Chapter 14 - Query Processor

Article
01/28/2010

This chapter describes the new features of the Microsoft SQL Server query processor. The query processor is that portion of SQL Server that accepts, parses, and executes SQL syntax. This chapter includes these topics:

Query processor goals
Query execution
Query optimization
Distributed queries
Working with queries

The query processor is closely related to the storage engine, another part of SQL Server. For more information, see "Storage Engine" later in this volume.

What Is a Query Processor?

A relational database consists of many parts, but at its heart are two major components: the storage engine and the query processor. The storage engine writes data to and reads data from the disk. It manages records, controls concurrency, and maintains log files.

The query processor* *accepts SQL syntax, selects a plan for executing the syntax, and then executes the chosen plan. The user or program interacts with the query processor, and the query processor in turn interacts with the storage engine. The query processor isolates the user from the details of execution: The user specifies the result, and the query processor determines how this result is obtained.

Query Processing Phases

There are two major phases in query processing: query optimization and query execution.

Query optimization is the process of choosing the fastest execution plan. In the optimization phase, the query processor chooses:

Which, if any, indexes to use.
The order in which joins are executed.
The order in which constraints such as WHERE clauses are applied.
Which algorithms are likely to lead to the best performance, based on cost information derived from statistics.

Query execution is the process of executing the plan chosen during query optimization. The query execution component also determines the techniques available to the query optimizer. For example, SQL Server implements a hash join algorithm and a merge join algorithm, both of which are available to the query optimizer.

The query optimizer is the brain of a relational database system, enabling it to work intelligently and efficiently.

A relational database with a sophisticated query optimizer is more likely to complete a query, especially a complex query, faster than a relational database with a simple query optimizer.

Types of Query Optimizers

There are two major types of query optimizers in relational databases: syntax-based and cost-based.

Syntax-based Query Optimizers

A syntax-based query optimizer creates a procedural plan for obtaining the answer to an SQL query, but the particular plan it chooses is dependent on the exact syntax of the query and the order of the clauses within the query. A syntax-based query optimizer executes the same plan every time, regardless of whether the number or composition of records in the database changes over time. Unlike a cost-based query optimizer, it neither maintains nor considers statistics about the database.

Cost-based Query Optimizers

A cost-based query optimizer chooses among alternative plans to answer an SQL query. Selection is based on cost estimates for different plans. The factors in making cost estimates include the number of I/O operations, the amount of CPU time, and so on. A cost-based query optimizer estimates these costs by keeping statistics about the number and composition of records in a table or index and is not dependent on the exact syntax of the query or the order of clauses within the query (unlike a syntax-based query optimizer).

Query Processor Goals

One of the goals for SQL Server 7.0 is to provide improved query processor support for a range of database activities, including large queries, complex queries, data warehousing, and online analytical processing (OLAP). These are some of the specific ways in which this goal is addressed by SQL Server:

More Choices for the Query Optimizer

Earlier versions of SQL Server provided limited ways of optimizing queries. For example, SQL Server 6.5 supports only one method for performing a join: nested loops iteration. SQL Server 7.0 adds hash join and merge join, which give the optimizer more options to choose from and are the algorithms of choice for many large queries.
Improved Query Execution

SQL Server 7.0 improves execution of a plan after it is chosen. Faster scans, sort improvements, and large memory support all offer potential performance advantages.
Parallel Execution of Queries

Symmetric multiprocessing (SMP) computers and striped disk sets are increasingly common. SQL Server 6.5 implemented parallel I/O and inter-query parallelism (assigning different queries to different processors), but it could not execute different parts of a single query in parallel. SQL Server 7.0 breaks a single query into multiple subtasks and distributes them across multiple processors for parallel execution.
Optimized Data Warehouse Queries

Star schemas and star queries are common in data warehousing applications. A star query joins a large central table, called a fact table, to one or more smaller tables, called dimension tables. SQL Server 7.0 recognizes these queries automatically and makes a cost-based choice among multiple types of star join plans.

Query Execution

Query execution** **is the process of executing the plan chosen during query optimization. The objective is to execute the plan quickly by returning the answer to the user (or more often, the program run by the user) in the least amount of time. This is not the same as executing the plan with the fewest resources (CPU, I/O, and memory). For example, a parallel query almost always uses more resources than a nonparallel query, but it is often desirable because it returns the result more quickly.

Query execution is presented before query optimization because the set of available execution techniques determines the set of choices available to the optimizer. The techniques include disk I/O, sorting, join and hash operations, index intersections, index joins, and parallelism.

Disk I/O

The foundation for efficient query processing is efficient data transfer between disks and memory. SQL Server 7.0 incorporates many improvements in disk I/O.

Sequential I/O vs. Random I/O

Disk I/O is one of the more expensive computer operations. There are two types of disk I/O: sequential I/O, which reads data in the same order as it is stored on the disk, and random I/O, which reads data in random order, jumping from one location to another on the disk. Random I/O can be more expensive than sequential I/O, particularly when large amounts of data are involved.

Microsoft SQL Server 7.0 maintains an on-disk structure that minimizes random I/O and allows rapid scans of large heap tables. These are tables without a clustered index, meaning that the data rows are not stored in any particular order. This is an important feature for decision support queries. Such disk-order scans can also be employed for clustered and nonclustered indexes if the index's sort order is not required in subsequent processing steps.

Earlier versions of SQL Server use a page chain, in which each page has a pointer to the next page holding data for the table. This results in random I/O and prevents read-ahead because, until the server reads a page, it does not have the location of the next page.

SQL Server 7.0 takes a different approach. An Index Allocation Map (IAM) maps the pages used by a table or index. The IAM is a bitmap, in disk order, of the data pages for a particular table. To read all the pages, the server scans the bitmap, determining which pages need to be read in what order. It then can use sequential I/O to retrieve the pages and issue read-aheads.

If the server can scan an index rather than reading the table, it will attempt to do so. This is useful if the index is a covering index (one that has all the fields necessary to satisfy the query). The server may satisfy a query by reading the B-tree index in disk order rather than in sort order. This results in sequential I/O and faster performance.

Large I/O

In SQL Server 7.0, all database pages are 8 KB and data is read in 64-KB extents (in earlier releases these figures were 2 KB and 16 KB, respectively). Both of these changes increase performance by allowing the server to read larger amounts of data in a single I/O request. This is particularly important for very large databases and decision support queries, in which a single request can process large numbers of rows.

Scanning Read-Ahead

SQL Server 7.0 takes increased advantage of striped disk sets by reading multiple extents ahead of the actual query processor request. This results in faster scans of heap tables and B-tree indexes.

Prefetch Hints

Disk-order scans speed up scans of large amounts of data. SQL Server 7.0 also speeds up fetching data using a nonclustered index.

When searching for data using a nonclustered index, the index is searched for a particular value. When that value is found, the index points to the disk address. The traditional approach is to issue immediately an I/O for that row, given the disk address. The result is one synchronous I/O per row and, at most, one disk at a time working to evaluate the query. This does not take advantage of striped disk sets.

SQL Server 7.0 takes a different approach. It continues looking for more record pointers in the nonclustered index. When it has collected a number of them, it provides the storage engine with prefetch hints. These hints tell the storage engine that the query processor needs these particular records soon. The storage engine then can issue several I/O requests simultaneously, taking advantage of striped disk sets.

Sort Improvements

Many different areas of the query processor rely on the sort algorithms: merge joins, index creations, stream aggregations, and so on. Sort performance is dramatically improved in SQL Server 7.0.

Many internal improvements make each sort operation faster: simpler comparisons, larger I/O operations, asynchronous I/O, and large memory. In addition, SQL Server 7.0 pipelines data between a sort operation and the query operations on its input and output sides, thus avoiding query plan phases, which were traditionally used in SQL Server and require writing and scanning intermediate work tables.

Merge Joins, Hash Joins, and Hash Teams

SQL Server 6.5 uses nested loops iteration, which is excellent for row-to-row navigation such as moving from an order record to three or four order-line items. However, it is not efficient for joins of many records, such as typical data warehouse queries.

SQL Server 7.0 introduces three new techniques: merge joins, hash joins, and hash teams, the last of which is a significant innovation not available in any other relational database.

Merge Joins

A merge join simultaneously passes over two sorted inputs to perform inner joins, outer joins, semi-joins, intersections, and unions. A merge join exploits sorted scans of B-tree indexes and is generally the method of choice if the join fields are indexed and if the columns represented in the index cover the query.

Hash Joins

A hash join hashes input values, based on a repeatable randomizing function, and compares values in the hash table for matches. For inputs smaller than the available memory, the hash table remains in memory; for larger inputs, overflow files on disk are employed. Hashing is the method of choice for large, nonindexed tables, particularly for intermediate results.

The hashing operation can be used to process GROUP BY clauses, distincts, intersections, unions, differences, inner joins, outer joins, and semi-joins. SQL Server 7.0 implements all the well-known hashing techniques including cache-optimized in-memory hashing, large memory, recursive partitioning, hybrid hashing, bit-vector filtering, and role reversal.

Hash Teams

The hash team is an innovation in SQL Server 7.0. Many queries consist of multiple execution phases; where possible, the query optimizer should take advantage of similar operations across multiple phases. For example, suppose you want to know how many order-line items have been entered for each part number and each supplier, as shown in this SQL code:

SELECT l_partkey, count (*)
FROM lineitem, part, partsupp
WHERE l_partkey = p_partkey and p_partkey = ps_partkey 
GROUP BY l_partkey

In response to this code, the query processor generates this execution plan.

Cc966472.sqc15001(en-us,TechNet.10).gif

This query plan performs a merge inner join between the lineitem table and the partsupp table. It calculates the count (stream aggregate), and then joins the result with the part table. This query never requires a sort operation. It begins by retrieving records in sorted order from the lineitem and partsupp tables using sorted scans over an index. This provides a sorted input into the merge join, which provides sorted input into the aggregation, which in turn provides sorted input into the final merge join.

Interesting Ordering

Interesting ordering refers to avoiding sort operations by keeping track of the ordering of intermediate results that move from operator to operator. SQL Server 7.0 applies this concept to hash joins. Consider the same query, but assume that the crucial index on lineitems has been dropped, so that the previous plan would have to be augmented with an expensive sort operation on the large lineitems table.

Cc966472.sqc15002(en-us,TechNet.10).gif

This query plan employs a hash join instead of a merge join; one of the merge joins is not affected by the dropped index and is therefore still very fast. The two hash operations are marked specially as the root and a member of a team. As data moves from the hash join to the grouping operation, work to partition rows in the hash join is exploited in the grouping operation. This eliminates overflow files for one of the inputs of the grouping operation, and thus reduces I/O costs for the query. The benefit is faster processing for complex queries.

Index Intersections

Microsoft SQL Server 6.5 selects the one best index for each table, even when a query has multiple predicates. SQL Server 7.0 takes advantage of multiple indexes, selecting small subsets of data based on each index, and then performing an intersection of the two subsets (that is, returning only those rows that meet all the criteria). For example, suppose you want to count orders for certain ranges of customers and order dates:

SELECT count (*)
FROM orders
WHERE o_orderdate between '9/15/1992' and '10/15/1992' and
o_custkey between 100 and 200

SQL Server 7.0 can exploit indexes on both o_custkey and o_orderdate, and then employ a join algorithm to obtain the index intersection between the two subsets. This execution plan exploits two indexes, both on the orders table.

Index Joins

Index joins are a variation on index intersections. When using any index, if all the columns required for a given query are available in the index itself, it is not necessary to fetch the full row. This is called a covering index because the index covers or contains all the columns needed for the query.

The covering index is a common and well-understood technique. SQL Server 7.0 takes it a step further by applying it to the index intersection. If no single index can cover a query, but multiple indexes together can cover the query, SQL Server considers joining these indexes. The alternative chosen is based on the cost prediction of the query optimizer.

Cc966472.sqc15003(en-us,TechNet.10).gif

Parallel Queries

Microsoft SQL Server 7.0 introduces intra-query parallelism, which is the ability to break a single query into multiple subtasks and execute them on multiple processors in an SMP computer.

SQL Server accomplishes this by automatically detecting that it is running on an SMP computer and determining the best degree of parallelism for each instance of a parallel query execution. By examining the current system workload and configuration, SQL Server determines the optimal number of threads and spreads the parallel query execution across those threads. When a query starts executing, it uses the same number of threads until completion. SQL Server chooses the optimal number of threads each time a parallel query execution plan is retrieved from the procedure cache. As a result, one execution of a query can use a single thread, and another execution of the same query (at a different time) can use two or more threads.

In addition to parallel queries, Microsoft SQL Server 7.0 supports parallel backup, parallel restore, and parallel load using multiple clients. For more information, see SQL Server Books Online.

Query Optimization

This section includes information about some of the new query optimization features in SQL Server 7.0. These include:

Multi-phase optimization
Automatic parameters
Transitive predicates
Nested queries
Moving GROUP BY clauses
Partitioned views
Star queries
Optimized updates

Multiphase Optimization

The SQL Server 7.0 query optimizer proceeds in multiple phases. First, it looks for a simple but reasonable plan of execution that satisfies the query. If that plan takes less time than a cost threshold value (for example, a fraction of a second), the query optimizer does not bother looking for more efficient plans. This prevents over-optimization, in which the query optimizer uses more resources to determine the best plan than are required to execute the plan.

If the first plan chosen takes more time than the cost threshold value, then the optimizer continues to look at other plans, always choosing the least-cost plan. The use of multiple phases provides a good trade-off between the time it takes to choose the most efficient plan and the time it takes to optimize for that plan.

Automatic Parameters

Most query processors allow you to precompile and store an execution plan—for example, a stored procedure. Precompiling is efficient because it supports reuse of the execution plan and allows users to submit variables as parameters to the plan. A new Open Database Connectivity (ODBC) interface for preparing requests for repeated execution also exploits this efficiency.

Many commercial applications and all ad hoc queries, however, do not use stored procedures. Instead, they use dynamic SQL. SQL Server 7.0 implements a new feature called automatic parameters that caches a plan created for dynamic SQL, turning constants into parameters. The result is less compilation effort, providing many of the efficiencies of stored procedures, even for those applications that do not employ stored procedures.

SQL Server 7.0 also introduces full support for parameterized queries, in which the application identifies the parameters. This is typical with ODBC, OLE DB, and PREPARE/EXECUTE.

Transitive Predicates

The transitive property of numbers states that if A = B and B = C, then A = C. This property can be applied to queries:

SELECT *
FROM part, partsupp, lineitem
WHERE ps_partkey = l_partkey and l_partkey = p_partkey and
ps_availqty > l_quantity and ps_supplycost > p_retailprice

Because both ps_partkey and p_partkey are equal to l_partkey, ps_partkey must be equal to p_partkey. The query processor takes advantage of this by deriving the third join predicate (ps_partkey equal to p_partkey). For example, in this query, the query processor begins by joining the partkey in the parts table to the partkey in the partsupp table, even though this particular join predicate is never specified in the query. It can do so because of transitive predicates.

Cc966472.sqc15004(en-us,TechNet.10).gif

Nested Queries

Correlated subqueries present special challenges for any SQL query processor. SQL Server applies some specific techniques to correlated subqueries, and it flattens them to semi-joins if possible. The advantage of flattening is that all the join algorithms can be applied. For large queries, this means that the query optimizer can consider hash joins or merge joins, rather than using the less efficient nested iteration join.

Moving GROUP BY Clauses

The SQL standards require processing a query in the following order:

Execute FROM and WHERE clauses.
Reduce data using the GROUP BY clause.
Apply any conditions in the HAVING clause.

However, any plan that produces the same result is also correct. Therefore, in some queries, you can evaluate the GROUP BY clause earlier, before one or more join operations required for the WHERE clause, thus reducing the join input and the join cost, for example:

SELECT c_name, c_custkey, count (*), sum (l_tax)
FROM customer, orders, lineitem
WHERE c_custkey = o_custkey and o_orderkey = l_orderkey and
o_orderdate between '9/1/1994' and '12/31/1994'
GROUP BY c_name, c_custkey

The query processor looks at the GROUP BY clause and determines that the primary key c_custkey determines c_name, so there is no need to group on c_name in addition to c_custkey. The query optimizer then determines that grouping on c_custkey and o_custkey produces the same result. Because the orders table has a customer key (o_custkey), the query processor can group by customer key as soon as it has the records for the orders table and before it joins to the customer table. This becomes evident in this execution plan.

Cc966472.sqc15005(en-us,TechNet.10).gif

The query processor first uses a merge join of the orders table (within the specified date range) and the lineitem table to get all order-line items. The second step is a hash aggregation, that is, the grouping operation. In this step, SQL Server aggregates the order-line items at the customer key level, counting them and calculating a sum of l_tax. SQL Server then sorts the output of the hash join and joins it to the customer table to produce the requested result. The advantage of this query plan is that the input into the final join is substantially reduced due to the earlier grouping operation.

Partitioned Views

The SQL Server query processor supports queries that rely on partitioned views. Partitioned views allow you to create multiple tables with constraints (essentially one table for each partition) and have the tables logically reunited in response to queries. Here is an example:

CREATE table Sales96Q1 constraint "Month between 1 and 3"
CREATE table Sales96Q2 constraint "Month between 4 and 6"

CREATE view Sales96 as
SELECT * from Sales96Q1 union all
SELECT * from Sales96Q2 union all

This data definition language (DDL) creates four tables, one for each quarter of sales, each with an appropriate constraint. The DDL then creates a view that reunites all four tables. Programmers must be aware of the partitioning for updates, but for decision support queries the partitioning is transparent. When the query processor receives a query against the view Sales96, it automatically identifies and removes tables that do not fall within the constraints of the query.

SELECT *
FROM Sales96 -- remember, this view has four tables
WHERE s_date between '6/21/1996' and '9/21/1996'

If you issue this query, the query processor generates a plan that touches only two of the tables in the view (Sales96Q2 and Sales96Q3), because the WHERE clause makes the other two tables irrelevant to the query. Different access paths can be used for the individual quarters. For example, you can use an index scan for the few days in Q2 (6/21-6/30) and a table scan for Q3. This is a useful method of improving the performance of queries that tend to select subsets of large tables on a well-known column. Time and location are typical examples.

The query processor detects all empty results when a constraint contradicts the selection criteria, even if you have not declared a view.

Star Queries

Databases designed for decision support, particularly data warehouses and data marts, often have very different table structures than OLTP databases. A common approach is to implement a star schema, a type of database schema designed to allow a user to intuitively navigate information in the database, as well as to provide better performance for large, ad hoc queries.

A star schema begins with the observation that information can be classified into facts, the numeric data that is the core of what is being analyzed, and dimensions, the attributes of facts. Examples of facts include sales, units, budgets, and forecasts. Examples of dimensions include geography, time, product, and sales channel. Users often express their queries by saying "I want to look at these facts by these dimensions," or "I want to look at sales and units sold by quarter."

A star schema takes advantage of this observation by organizing data into a central fact table and surrounding dimension tables.

Cc966472.sqc15006(en-us,TechNet.10).gif

Microsoft SQL Server 7.0 has several techniques for optimizing queries against star schemas. The query processor automatically recognizes these queries and applies any and all of the techniques presented here, as well as combinations of these techniques. It is important to understand that the choice of which technique to apply is entirely cost-based; no hints are required to force these optimizations.

The tables in a star schema do not contain equal numbers of records. Typically, the fact table has many more records. This difference becomes important in many of the query optimization techniques.

A straightforward execution strategy is to read the entire fact table and join it in turn to each of the dimension tables. If no filter conditions are specified in the query, this can be a reasonable strategy. However, when filter conditions are present, the star query optimizations avoid having to read the entire fact table by taking full advantage of indexes.

Cartesian Products and Composite Indexes

Because dimension tables typically have fewer records than the fact table, it can make sense to compute the Cartesian product and use the result to view fact table rows in a multicolumn index.

For example, assume that the sales (fact) table has 10 million rows, the period table has 20 rows, the market table has 5 rows, and the product table has 200 rows. Also assume that a user generates this query using a front-end tool:

SELECT sales.market_id, period.period_id, sum(units), sum(dollars)
FROM sales, period, market
WHERE period.period_id = sales.period_id and
sales.market_id = market.market_id and
period.period_Desc in ('Period2','Period3','Period4','Period5')
and market.market_Desc in ('Market1','Market2')
GROUP BY sales.market_id, period.period_id

A simple approach is to join the period table to the sales table. Assuming an even distribution of data, the input is 10 million rows, the output is 4/20 (4 periods out of a possible 20), or 2 million rows. This can be done using a hash join or merge join and involves reading the entire 10 million rows in the fact table or retrieving 2 million rows through index lookups, whichever costs less. This partial result is then joined to the reduced market table to produce 800K rows of output, which are finally aggregated.

If there is a multicolumn index on the fact table (for example, on period_id or market_id) the Cartesian product strategy can be used. Because there are 4 rows selected from the period table and 2 rows from the market table, the Cartesian product is 8 rows. These eight combinations of values are used to look up the resulting 800K rows of output. Multicolumn indexes used this way are sometimes called star indexes.

Semi-Join Reductions and Index Intersection

If the joins to the fact table occur on fields other than those contained in a composite index, the query optimizer can use other indexes and reduce the number of rows read from the fact table by performing an intersection of the qualifying rows from joins to each dimension table.

For example, if period_id or product_id is selected, the query optimizer cannot use the composite index because the two fields of interest are not a leading subset of the index. However, if there are separate single-column indexes on period_id and product_id, the query optimizer may choose a join between the period table and the sales table (retrieving 2 million index entries) and, separately, a join between the product table and the sales table (retrieving 4 million index entries). In both cases, the optimizer will do the join using an index; thus, the two preliminary joins compute sets of record IDs for the sales table but not full rows of the sales table. Before retrieving the actual rows in the sales table (the most expensive process), an intersection of the two sets is computed to determine the qualifying rows. Only rows that satisfy both joins are included in this intermediate result of 800K rows and only these rows are actually read from the sales table.

Combined Techniques

Occasionally, semi-join reduction can be combined with Cartesian products and composite indexes. For example, if you select from three dimension tables, where two of the three tables are the initial fields of a composite index but the third dimension has a separate index, the query optimizer can use Cartesian products to satisfy the first two joins, use semi-joins to satisfy the third join, and then combine the results.

Several factors influence the effectiveness of these techniques. These factors include the size of the indexes to be used compared to the base table, and whether the set of indexes used covers all the columns required by the query, obviating the need to look up rows in the base table. These factors are taken into account by the query optimizer, which will choose the least expensive plan.

If that plan is more expensive than the cost threshold for parallelism, all required operations, including scans, joins, intersection, and row fetching, can be executed by multiple threads in parallel.

Optimized Updates

If a row in a table is updated, indexes on the table also must be updated. For small updates, such as OLTP operations, it is appropriate to update the indexes row-by-row as you update each row of the base table.

For large updates, such as a data warehouse refresh, row-by-row updates can be inefficient, resulting in a high volume of random I/O to the index records. A better approach is to delay updating the indexes until all the base tables are updated, then presort the changes per index and simultaneously merge all the changes into the index. This assures that each index leaf page is touched once at most and that SQL Server traverses each B-tree index sequentially.

The query optimizer takes this approach if it is the least expensive approach. The benefit is that large data warehouse refreshes can be accomplished more efficiently.

The query optimizer also plans the join operations required to enforce referential integrity constraints, making a cost-based choice among (index) nested loops join, merge join, and hash join.

Distributed Queries

In addition to storing and searching data locally, Microsoft SQL Server 7.0 also can be used as a gateway to many other data stores, both relational and nonrelational data sources.

SQL Server 7.0 performs distributed queries, that is, queries that involve data from two or more servers. It supports retrievals, updates, and cursors across servers, and ensures transaction semantics across nodes using the Microsoft Distributed Transaction Coordinator (MS DTC). It also maintains security across servers.

If any remote servers support indexes or SQL queries, the SQL Server query optimizer determines the largest possible query that can be sent to each remote server. In other words, the query optimizer assigns the maximum possible data reduction to each remote server. For example, if a remote query is issued against a 1-million row table, with a WHERE clause or an aggregation that returns only 10 records, the 1 million rows are processed at the remote server, and only 10 records are sent across the network. This reduces network traffic and overall query time. Typical operations that are pushed toward the data source are selections, joins, and sorts.

Heterogeneous Queries

Distributed queries may be heterogeneous, supporting any OLE DB or ODBC data source. The SQL Server 7.0 compact disc includes OLE DB drivers for Oracle 7.x, Oracle 8.x, Microsoft Excel, Microsoft Access, dBASE, Paradox, and Microsoft Visual FoxPro database development system, as well as an OBDC gateway for other relational databases. OLE DB providers for other server databases (IBM DB2, SYBASE, and Informix) are available from third parties.

Pass-through Queries

If a remote server supports syntax that is not standard SQL, or if the remote data source supports a query language other than SQL, the OPENQUERY* *operator is provided to pass through the query syntax unchanged.

Working with Queries

In addition to improved query processing capabilities in the server, Microsoft SQL Server 7.0 also offers these improved tools and features to work with database queries:

SQL Server Query Analyzer
Query Governor
SQL Server Profiler

SQL Server Query Analyzer

SQL Server 7.0 provides SQL Server Query Analyzer, an interactive, graphical tool that allows a database administrator or developer to write queries, execute multiple queries simultaneously, view results, analyze the plan of a query, and receive assistance to improve the performance of a query. The Showplan option graphically displays the data retrieval methods chosen by the SQL Server query optimizer. This is useful for understanding the performance characteristics of a query. In addition, SQL Server Query Analyzer suggests additional indexes and statistics on nonindexed columns that will improve the query optimizer's ability to process a query efficiently. In particular, SQL Server Query Analyzer shows what statistics are missing, thus forcing the query optimizer to guess about predicate selectivity, and permits the creation of those statistics.

Query Governor

Query cost refers to the estimated elapsed time, in seconds, required to execute a query on a specific hardware configuration. On other hardware configurations, there is a correlation between cost units and elapsed time, but cost units do not equal seconds. The query governor lets you specify an upper cost limit for a query; a query that exceeds this limit is not run.

Because it is based on estimated query cost rather than actual elapsed time, the query governor does not have any run-time overhead. It also stops long-running queries before they start, rather than running them until they reach a predefined limit.

SQL Server Profiler

SQL Server Profiler is a graphical tool that allows system administrators to monitor engine events, such as:

Login connects, fails, and disconnects
Transact-SQL SELECT, INSERT, UPDATE, and DELETE statements
Remote procedure call (RPC) batch status
Start or end of a stored procedure
Start or end of statements within stored procedures
Start or end of an SQL batch
Lock acquired or released on a database object
An opened cursor
Missing statistics that force the query optimizer to guess a predicate's selectivity

Data about each event can be captured and saved to a file or a SQL Server table for later analysis. Event data can be filtered so that only a relevant subset is collected. For example, only the events that affect a specific database, or those for a particular user, are collected; all others are ignored. Alternatively, data could be collected about only those queries that take longer than 30 seconds to execute.

SQL Server Profiler allows captured event data to be replayed against SQL Server, thereby effectively reexecuting the saved events as they originally occurred. You can troubleshoot problems in SQL Server by capturing all the events that lead up to a problem, and then, by replaying the events on a test system, replicate and isolate the problem.

Index Tuning Wizard

The Index Tuning Wizard is a new and powerful tool that analyzes your workload and recommends an optimal index configuration for your database.

Features of the Index Tuning Wizard include:

Comparing, contrasting, and selecting the best mix of indexes using the query optimizer's cost calculations.
Recommending the best mix of indexes for a workload (trace file or SQL Script) against a database.
Providing index, workload, table-use, and query-cost analysis.
Tuning for an entire workload, which may include dropping existing indexes to reduce maintenance costs.
Tuning the database for a set of problem queries without dropping existing indexes.
Prototyping index configuration recommendations for different disk space constraints.

The Index Tuning Wizard creates SQL statements that can be used to drop ineffective indexes or to create new, more effective indexes and statistics on nonindexed columns. The SQL statements can be saved for manual execution as necessary.

For more information, see "Index Tuning Wizard" earlier in this volume.

Automatic Statistics Creation and Refresh

When you create an index, SQL Server 7.0 automatically stores statistical information regarding the distribution of values in the indexed columns. It also supports statistics on nonindexed columns. The query optimizer uses these statistics to estimate the size of intermediate query results as well as the cost of using the index for a query.

If the query optimizer determines that the statistics needed to optimize a query are missing, it automatically creates them and saves them in the database. Moreover, it automatically updates the statistics as the data in a table changes, and it eventually discards the statistics if they are not reused.

Statistics are created and refreshed very efficiently by sampling. The sampling is random across data pages and taken from a table or nonclustered index for the smallest index containing the columns needed by the statistics. The volume of data in the table and the amount of changing data determine the frequency with which the statistical information is updated. For example, the statistics for a table containing 10,000 rows may need updating when 1,000 rows have changed because 1,000 is a significant percentage of the table. However, for a table containing 10 million rows, 1,000 changes are less significant.