SQL Server 2000 Windows CE Edition 2.0 Query Processor Overview and Performance Tuning Approaches

Article
01/28/2010

Updated : November 4, 2002

Author Song Xue

Published October 2002

Applies To SQL Server 2000 Windows CE Edition 2.0

Summary This paper provides an overview of Microsoft SQL Server 2000 Windows CE Edition 2.0 (SQL Server CE) query processor, followed by some application and query performance tuning approaches.

This paper assumes that readers have a basic understanding of relational databases, including tables, columns, and indexes, and of Structured Query Language (SQL). All examples in this paper are based on a sample database called Northwind. This database is included in the sample applications that ship with SQL Server CE.

Disclaimer Some information and techniques covered in this paper apply to SQL Server 2000 Windows CE Edition 2.0 only. They might not be true for future versions of SQL Server CE.

Overview of SQL Server CE Query Processor

Structured Query Language (SQL) is a declarative language. It specifies what to do instead of how to do it. It's the job of a query processor to bridge that gap. Microsoft SQL Server 2000 Windows CE (SQL Server CE) processes queries in the following four steps:

Parsing

SQL Server CE parser validates the syntax of the query. It extracts various parts of the query, such as the projection list, base table list, and WHERE clause, into a compiler-ready data structure.

Compilation

SQL Server CE compiler validates that the query is semantically correct. It performs binding, which includes verifying that referenced tables and columns exist and then loading the metadata for those tables and columns. The compiler also verifies that the types of columns and expressions are compatible and performs implicit conversions if necessary. (For example, to compile the addition of a float number and an integer, the compiler inserts a conversion operator to convert the integer into a float before adding it to the other float.) Finally, SQL Server CE compiler constructs an initial execution plan. This plan is then revised by the optimizer.

Optimization

The optimizer is the heart of the query processor. SQL Server CE optimizer employs a combination of heuristic and cost-based transformations to optimize queries. Heuristic transformations rewrite the query into a semantically equivalent form that is expected to lead to a better execution plan. Heuristic transformations are based on syntax. They are always applied when applicable.

One of the SQL Server CE rewrite rules merges multiple adjacent inner join operators, so that the cost-based optimizer can consider other join orders as alternatives to that declared. For example:

... FROM (Table_1 INNER JOIN Table_2 ON Table_1.Col = Table_2.Col)
INNER JOIN Table_3 ON Table_1.Col = Table_3.Col ...

is rewritten into:

... FROM Table_1, Table_2, Table_3 WHERE Table_1.Col = Table_2.Col AND
Table_1.Col = Table_3.Col ...

This transformation allows the optimizer to consider join orders other than Table_1 joining Table_2, and then joining Table_3.

The application of heuristic transformations is un-conditional. On the other hand, cost-based transformations work differently. Because these transformations might or might not lead to better execution plans, they are evaluated based on their costs.

Cost is the estimated amount of time to run an execution plan. Cost is broken down into two parts: Input/Output (I/O) cost and CPU cost. A database file consists of pages such as data pages and index pages. On Windows CE based devices, the file might be located in RAM, or on a compact flash card or microdrive, etc. For SQL Server CE to work on the pages, they need to be brought into the storage engine cache. If the pages are modified, the changes need to be saved. Such operations incur I/O cost. Examples of operations that incur CPU cost are expression evaluation, data conversion, and sorting. Cost is based on data characteristics collected during compilation, and on the algorithms for the execution plan. For example, the size of a table influences the cost of a table scan on it. Different join methods have different cost associated with them because they have different algorithms.

The following are some examples of operations determined by the cost-based optimizer:

Base table scan type (file scan vs. index scan)
What indexes, if any, are used
Join order and join algorithm
Sort placement
Filter placement

Let's take a closer look at join order. We've already provided an example of a rewrite rule that makes it possible to consider join orders other than that declared. The optimizer considers alternative join orders and chooses the best order based on cost. For some complex SELECT queries, there are thousands of ways to join the tables together. SQL Server CE optimizer doesn't attempt to analyze all the possibilities. Instead, it examines a subset and finds a join order with cost reasonably close to the theoretical minimum. This is important because Windows CE based devices have lower power CPUs and less memory resources than server computers. The optimization process needs to finish quickly.

The following are the major steps SQL Server CE optimizer goes through to optimize queries:

Checks the initial execution plan for applicable heuristic transformations and performs those transformations.
Invokes the cost-based optimizer to determine join order, join methods, filter positioning, etc. At the end of this step, the plan with the lowest cost of those considered is chosen.
Generates executable structures that implement the best plan.
Through a variety of approaches, you can help the optimizer select a better execution plan, which results in better performance. Later in this paper we will go through these approaches in details.

Execution

The SQL Server CE execution engine takes the executable structures generated by the optimizer. It either carries out the actions for data definition (DDL) and data manipulation (DML) queries, or returns a cursor for SELECT queries on which the application can iterate.

The following is a list of the additional components of the SQL Server CE query processor, and their respective roles in query processing:

Type system Encapsulates a list of supported data types and the behaviors associated with them, such as data conversion and intrinsic operations.
Expression service Compiles and evaluates arbitrary expressions, such as arithmetic, logical, and subquery expressions.
Storage layer Interacts with SQL Server CE storage engine. The storage layer has two main functions. It queries the storage engine for database schema information that is necessary to compile and optimize queries, and it calls the storage engine to carry out the operations contained in the execution plan.

Figure 1: SQL Server CE Query Processor Architecture Diagram

Application and Query Performance Tuning

There are obvious ways to improve the performance of your application, such as increasing device CPU speed and adding more memory. However, many of the biggest performance gains are the result of changes to application and database design. A good database design is critical to the performance of your application. Performance tuning should start with a review of your database schema.

Consider Database De-normalization

A normalized database eliminates functional dependencies in the data so that updating the database is easy and efficient. However, querying the database might require many joins of tables in order to combine information together. As the number of join tables increases, the query running time increases substantially. Therefore, a normalized database might not always be the best choice. A database with the appropriate amount of de-normalization reduces the number of tables that need to join together, without adding too much complication to the updating process. This is often a good compromise.

In general, if a significant portion of your queries require joins of more than five or six tables, you might want to consider de-normalization. For example, in the Northwind database, which is included in the sample applications that ship with SQL Server CE, we can choose to store the name of the shipper with the order entry in the Orders table, thereby eliminating the need for a separate Shippers table. The result is one fewer table to join with.

There are other kinds of database de-normalization. Let's illustrate with an example. In the Northwind database, the Orders table contains information about a customer's entire order. The individual products in an order are contained in the Order Details table. Suppose we want to query the total dollar amount for each order. First we need to find out the dollar amount for each product (units * unit price – applicable discount). Then we need to group the amounts by order. Here is what the query looks like:

SELECT "Order ID", SUM("Unit Price" * Quantity * (1.0 - Discount)) 
     AS Total FROM "Order Details" 
GROUP BY "Order ID"  
 Order ID               Total
----------------------------------------
               10000                 108
               10001    1363.15000915527
               10002    731.800003051758
               10003    498.180023193359
               10004    3194.19999694824
               10005    173.400009155273
               10006    87.2000007629395
               10007                1405
               10008                1171
               10009                1530
               10010                 470
               ...                   ...
(1078 rows affected)

The amount of calculation for this query is not trivial. For a large set of orders, it can take some time. The alternative is to calculate the dollar amount of the order at the time it is placed, and store that amount in a column in the Orders table. With this approach, all we need to do to answer the original question is to project that column, like this:

SELECT "Order ID", "Order Total" AS Total FROM Orders

The following table shows the performance gain when running this query, instead of the previous one, on the modified Northwind database.

Table 1 Performance gain from database de-normalization

	Running time (seconds)
Original	8.243
Pre-computed	4.483

The tradeoff for this faster approach is that an extra column needs to be maintained when you create and update an order.

Choose Between Variable and Fixed Length Columns

When you design your tables, it helps to understand the tradeoffs of using variable length columns vs. fixed length columns. Variable length columns reduce database size because they only take what is needed to store the actual value, while fixed length columns always take maximum space defined by the schema, even when the actual value is empty. The downside for variable length columns is that some operations are not as efficient as those on fixed length columns. For example, if a variable length column starts small and an UPDATE causes it to grow substantially, the record might have to be relocated. Furthermore, frequent updates cause data pages to become more fragmented over time. Therefore, it is recommended that you use fixed length columns when data lengths don’t vary too much and when frequent updates are performed.

Consider Small Row Length

As a general rule, the more compact the row, the more rows fit on a page. Consequently, a single disk operation on a table with compact rows will retrieve more rows, making the operation more effective. In addition, more rows fit in the storage engine cache, potentially improving the hit ratio. Another disadvantage of large rows is the wasted space on data pages. Consider this extreme example: If the record size is a little bigger than half of a data page, almost half the space on each data page is wasted.

Some database designers opt for wide table design and port their mainframe database schema down to the device. This might not be an efficient design. One possible approach is to break up the most critical tables. Suppose you have a table with some columns with very stable values and others that change all the time. It makes sense to split the table into two: one with frequently referenced columns, the other with stable columns. This way, you have all the benefits of smaller row length. The tradeoff is that a join is required to combine the information. You can try both approaches and pick the one that suits your performance goals.

Consider Small Key Length

An index is essentially an ordered subset of the table it is created on. It permits fast range lookup and sort order. As you'd expect, the smaller the index key, the less space the index takes, and the more effective it is. It is a particularly good practice to make the primary key compact, because it is frequently referenced as a foreign key in other tables. In the absence of a natural compact primary key, you can use an identity column implemented as an integer instead.

Similarly, you should carefully evaluate your wide indexes. An index with one or only a few key columns is called a narrow index; an index with many key columns is called a wide index. A wide index is often associated with a large key length. An extreme example is an index that includes every column in the table. By creating such an index, you effectively make a duplicate of the original table, which is inefficient, both in terms of database size and query performance.

Create Useful Indexes

Creating useful indexes is one of the most important ways to achieve better query performance. Useful indexes help you find data with fewer disk I/Os and less system resource usage.

The trick to creating useful indexes is to understand how the data is used, the types of queries and the frequencies they are run, and how the query processor can use indexes to find your data quickly.

When choosing what indexes to create, you should look at your critical queries, whose performance will affect user experience most. Create indexes to specifically aid these queries. After adding an index, rerun the query to see if performance is improved. If it is not, remove the index. Later in this paper, we'll enumerate some scenarios in which indexes are used to aid query evaluation.

As usual, there's a flip side. With more indexes, your SELECT queries will potentially run faster. However, DML (INSERT/UPDATE/DELETE) operations will significantly slow down as more indexes need to be maintained. Therefore, if your queries are mostly SELECT statements, more indexes can be helpful. (Remember to keep an eye on the size of your database: The more indexes, the bigger the database.) If your application performs many DML operations, you should be conservative with the number of indexes you create.

The concept of useful indexes is explained in detail in the next few sections.

Create Highly Selective Indexes

Indexing on columns used in the WHERE clause of your critical queries often improves performance. However, this depends on how selective the index is likely to be. Consider selectivity as ratio of qualifying rows to total rows. If the ratio is low, the index is highly selective. It can eliminate the majority of the rows and drastically reduce the size of the result set. It is therefore considered useful. By contrast, an index that is not selective is not considered useful.

A unique index has the greatest selectivity. Only one row can match, which makes it most helpful for queries that intend to return exactly one row. For example, an index on Orders."Order ID" will help you find a particular order quickly.

You can evaluate the selectivity of an index by running a query that shows the data characteristics of its key columns. To compare an index on Orders."Customer ID" and another on Orders."Ship Via", you can run the following two queries:

SELECT "Customer ID", COUNT("Customer ID") AS "Count" 
FROM Orders GROUP BY "Customer ID"
         Customer ID               Count
----------------------------------------
               ALFKI                   7
               ANATR                   4
               ANTON                  13
               AROUT                  14
               BERGS                  23
               BLAUS                   8
               BLONP                  14
               BOLID                   7
               BONAP                  19
               BOTTM                  20
               BSBEV                  12
               CACTU                   6
               CENTC                   3
               CHOPS                  12
               COMMI                   5
               CONSH                   4
               DRACD                   9
               DUMON                   8
               EASTC                  13
               ERNSH                  33
               ...                   ...
(90 rows affected)
SELECT "Ship Via", COUNT("Ship Via") AS "Count" 
FROM Orders GROUP BY "Ship Via"
            Ship Via               Count
----------------------------------------
                   1                 320
                   2                 425
                   3                 333
(3 rows affected)

Orders."Customer ID"

has low degree of duplication, which makes the index on it more selective. On the other hand, the index on FakePre-5da3d35f77c6488a855691a40152dfd5-1744ac1532d4478cb59cae73987f473e is not selective due to the high degree of duplication in the key column.

Create Multiple-Column Indexes

Multiple-column indexes are natural extensions of single-column indexes. They follow the same set of rules such as those regarding selectivity. They have the same set of applications as well. For example, they are useful for evaluating filter expressions that match a prefix set of key columns. The following composite index on employee last name and first name,

CREATE INDEX Idx_Emp_Name ON Employees ("Last Name" ASC, "First Name" ASC)

will help evaluate the following query:

... WHERE "Last Name" = 'Doe'

It will also help evaluate this query:

... WHERE "Last Name" = 'Doe' AND "First Name" = 'John'

However, it will not be useful for this one:

... WHERE "First Name" = 'John'

In addition, the order in which the columns actually appear in the WHERE clause does not matter. The same index can be used to evaluate the following query.

... WHERE "First Name" = 'John' AND "Last Name" = 'Doe'

Note: SQL Server CE uses a double quotation mark (") as the delimiter for database objects such as tables, columns, and indexes. It uses a single quote (') for string literals.

When creating a multiple-column index, you should put the most selective columns leftmost in the key. This makes the index more selective when matching various expressions.

Avoid Indexing Very Small Tables

A small table is one whose contents fit in one data page or just a few data pages. SQL Server CE stores data in 4K pages. The page count can be roughly calculated using the following formula. The actual count can be slightly larger, due to the storage engine overhead.

<sum of sizes of columns in bytes> * <table cardinality>
<# of pages> = --------------------------------------------------------
                              4096

For example, the Order Details table has the following schema and 2820 rows.

Table 2 Schema of Order Details table

Column Name	Type
Order ID	INTEGER
Product ID	INTEGER
Unit Price	MONEY
Quantity	SMALLINT
Discount	REAL

According to the formula, it takes about 16 pages to store its data:

<# of pages> = ((4 + 4 + 8 + 2 + 4) * 2820) / 4096 = 15.15

You should avoid indexing very small tables because it is usually more efficient to just do a table scan. This saves the cost of loading and processing index pages. Not creating an index on very small tables eliminates the possibility of the optimizer choosing one.

Index Primary Keys, Foreign Keys, and Other Join Columns

It is always useful to create indexes on primary keys. It is often useful to do so on foreign keys. This is because primary keys and foreign keys are frequently used to join tables. Indexes on them allow more efficient index join algorithms to be considered by the optimizer. If your query joins tables using other columns, it is often helpful to create indexes on them for the same reason.

When primary key and foreign key constraints are created, SQL Server CE automatically creates indexes for them and takes advantage of them when optimizing queries. Remember to keep primary keys and foreign keys small. Joins run faster this way.

As a demonstration, indexes on Customers."Customer ID" and Orders."Customer ID" are helpful for the following query:

SELECT Customers."Customer ID", Orders."Order Date"
   FROM Customers INNER JOIN Orders  ORDER BY Customers."Customer ID"
   ON Customers."Customer ID" = Orders."Customer ID"

Note The query above is equivalent to the one below. Query processor ignores the order of input tables declared in both SQL statements. It picks the order with the lowest cost.

SELECT Customers."Customer ID", Orders."Order Date"
   FROM Customers, Orders
   WHERE Customers."Customer ID" = Orders."Customer ID"
   ORDER BY Customers."Customer ID"

Write Optimizable / Sargable Clauses

Indexes can be used to speed up evaluations of certain types of filter clauses. Therefore it is advantageous to phrase your filters using those types of clauses and create indexes that help their evaluations.

While all filter clauses reduce the final result set of a query, some can also help reduce the amount of data that need to be scanned. In other words, they are useful as a search argument (SARG) or as part of the join criteria. They can make use of an index for faster retrieval. Therefore, these clauses are called optimizable or sargable.

A SARG limits a search because it specifies an exact match, a range of values, or a conjunction of two or more items joined by AND. It has one the following forms:

Column operator <constant or variable>

-Or-

<constant or variable> operator Column

Sargable operators include =, >, <, >=, <=, IN, BETWEEN, and sometimes LIKE (in cases of prefix matching, such as LIKE 'John%'). A SARG can include multiple conditions joined with an AND. SARGs can be queries that match a specific value, such as:

"Customer ID" = 'ANTON'
'Doe' = "Last Name"

or queries that match a range of values, such as:

"Order Date" > '1/1/2002'
"Customer ID" > 'ABCDE' AND "Customer ID" < 'EDCBA'
"Customer ID" IN ('ANTON', 'AROUT')

An expression that is not sargable cannot limit the search. In other words, SQL Server CE query processor has to evaluate every row to determine whether it meets the filter clause. An index is not useful to non-sargable expressions. Examples of non-sargable operators are NOT, <>, NOT EXISTS, NOT IN, NOT LIKE, and intrinsic functions.

These are some non-sargable examples:

UPPER("Customer ID") = 'ANTON'
"Last Name" LIKE '%Doe%'

Certain non-sargable filter clauses can be rewritten into sargable ones. For example:

UPPER("Customer ID") = 'ANTON'

can be rewritten into:

"Customer ID" = "ANTON"

Understand How the Optimizer Selects an Index

When determining the access methods for base tables, SQL Server CE optimizer finds out whether an index exists for a sargable clause. If so, it evaluates the effectiveness of the index by calculating the selectivity (that is, how many rows will be returned). It then estimates the cost of finding qualifying rows using the index. It will choose indexed access if it has lower cost than table scan. An index is potentially useful if its first column or prefix set of columns are used in the SARG, and the SARG establishes a lower bound, upper bound, or both to limit the search. All the previous examples of sargable clauses can be potentially implemented with properly created indexes.

When writing queries, try to use sargable clauses that will take advantage of existing indexes. If the optimal index doesn't exist, re-evaluate the database schema according to your performance goals, and find out if it makes sense to add it.

Understand Response Time vs. Total Time

Response time is what it takes for the query to return the first record. Total time is what it takes for the query to return all records. For an interactive application, response time is important because it is the perceived time for the user to receive visual affirmation that a query is being processed. For a batch application, total time is what affects the overall throughput. You need to determine what the performance criteria are for your application and queries, and then design accordingly.

For example, suppose the query returns 100 records and is used to populate a five line list box with the first five records. We probably are not concerned with how long it takes to return all 100 records. Instead, we want the query to return the first few records quickly, so we can perform the initial population of the list box. On the other hand, if we need to process all 100 records, we probably are more concerned with the total running time instead of how fast the first record is returned.

Many query operations can be performed without having to store intermediate results. These operations are said to be pipelined. Examples of pipelined operations are projections, selections, and joins. Queries implemented with these operations can return results right away. Other operations, such as sort and group-by, require consuming all their input before returning results to their parent operations (an exception is when an index is used to satisfy the sort and group-by requirement). These operations are said to require materialization. Queries implemented with these operations tend to have an initial delay due to materialization. After this delay is over, however, they tend to return records very quickly.

Queries with response time requirements should avoid materialization. For example, using an index to implement order-by instead of sorting yields better response time. The following section describes this in detail.

Index the ORDER-BY / GROUP-BY / DISTINCT Columns for Better Response Time

The order-by, group-by, and distinct operations all have the common requirement of sorting. For example, to evaluate group-by, the input relation needs to be sorted on group-by expressions. Distinct operation is a special kind of group-by where all output expressions are group-by expressions. SQL Server CE query processor implements sort in two ways. If records are already sorted by an index, then we only need to walk the index. Otherwise, we have to use a temporary work table to sort the records manually. Such manual sorting is costly operation on devices with lower power CPUs and limited memory, and results in a substantial initial delay. Therefore it should be avoided if response time is important.

For example, an index on Customers."Customer ID" can potentially improve the response of the following two queries. Without it, the initial delays occur.

SELECT "Customer ID"
   FROM Customers
   ORDER BY "Customer ID"
SELECT "Customer ID", COUNT("Customer ID")
   FROM Orders
   GROUP BY "Customer ID"

If an index can benefit both the sargable clause and the order-by clause, it is used to optimize both. For example:

SELECT "Customer ID"
   FROM Customers
   WHERE "Customer ID" LIKE 'F%'
   ORDER BY "Customer ID"

In the context of multiple-column indexes, for order-by / group-by to consider a particular index, the order-by / group-by columns must match the prefix set of index columns with the exact order. Consider this index,

CREATE INDEX Emp_Name ON Employees ("Last Name" ASC, "First Name" ASC)

This index can help optimize:

... ORDER BY / GROUP BY "Last Name" ...
... ORDER BY / GROUP BY "Last Name", "First Name" ...

It cannot optimize:

... ORDER BY / GROUP BY "First Name" ...
... ORDER BY / GROUP BY "First Name", "Last Name" ...

In order for distinct operation to consider a multiple-column index, the projection list must match all index columns, though they don't have to be in the exact order. The previous index can help optimize:

... DISTINCT "Last Name", "First Name" ...
... DISTINCT "First Name", "Last Name" ...

It cannot optimize:

... DISTINCT "First Name" ...
... DISTINCT "Last Name" ...

The following chart shows the performance comparison of order-by implemented with index and sort on tables of various sizes. The actual queries run are the following:

SELECT "Customer ID" FROM Customers ORDER BY "Customer ID"
SELECT "Order ID" FROM Orders ORDER BY "Order ID"
SELECT "Order ID" FROM "Order Details" ORDER BY "Order ID"

Figure 2: Performance comparison of ORDER-BY using index and sort

You might notice that in all three cases the total time is better with index instead of sort. This is one area where query processing on Windows CE based devices differs from that on server boxes because of limited CPU power and available RAM. With SQL Server CE, index scan is often faster than sorted table scan. You can try both approaches and find out which is best for your application.

Additionally, if your query always returns unique rows on its own, avoid specifying the DISTINCT keyword, because it only adds overhead.

Rewrite Subquery To Use Join

In some cases you can rewrite a subquery to use join and achieve better performance. Ideally the query processor considers both approaches during optimization (because they are semantically equivalent), and picks the better one. In actuality, however, it does so for some subqueries but not for others.

Between subquery and join approaches, one is not always better than the other. The advantage of the join approach is that you can evaluate tables in a different order from that defined by the query. The advantage of the subquery approach is that it is often not necessary to scan all rows from the subquery to evaluate the subquery expression. For example, an EXISTS subquery can return TRUE upon seeing the first qualifying row.

SQL Server CE query processor always rewrites the IN subquery using join. There is no need to try this approach with queries containing the IN subquery clause.

Here is an illustration of the rewrite. For example, to find out all the orders that have at least one item with a 25% discount or greater, you can use the following EXISTS subquery.

SELECT "Order ID" FROM Orders O
   WHERE EXISTS (SELECT "Order ID"
                   FROM "Order Details" OD
                   WHERE O."Order ID" = OD."Order ID" 
                        AND Discount >= 0.25)

You can also rewrite this using join:

SELECT DISTINCT O."Order ID" FROM Orders O INNER JOIN "Order Details" 
OD ON O."Order ID" = OD."Order ID"  WHERE Discount >= 0.25

There is a substantial difference in total time between the two approaches, as shown in the following chart.

Figure 3: Performance comparison of subquery vs. join

Place Filters in a WHERE Clause Instead of a HAVING Clause

The filters discussed here reference the group-by columns only and do not contain aggregate expressions. They can be placed either in a HAVING clause or in a WHERE clause. Because SQL Server CE query processor evaluates WHERE clauses earlier than HAVING clauses, it is better to place them in a WHERE clause. This way, they are applied earlier to reduce the size of the intermediate result. This leads to better performance. Of course, filters that contain aggregate expressions will remain in the HAVING clause.

For example, to find out the customers whose IDs start with 'A' and the number of orders they have placed, we could phrase the query this way:

SELECT "Customer ID", COUNT("Customer ID")  
FROM Orders GROUP BY "Customer ID"
 HAVING "Customer ID" > 'A' AND "Customer ID" < 'B'
          Customer ID                  #1
----------------------------------------
               ALFKI                   7
               ANATR                   4
               ANTON                  13
               AROUT                  14
(4 rows affected)

Here is an equivalent query, with filter expressions moved to WHERE clause:

SELECT "Customer ID", COUNT("Customer ID")  
FROM Orders WHERE "Customer ID" > 'A 
AND "Customer ID" < 'B' GROUP BY "Customer ID"

The following chart shows the performance comparison of the two queries. In this case, there is a substantial advantage to placing filter expressions in a WHERE clause.

Figure 4: Performance comparison of filter in HAVING clause vs. WHERE clause

Limit the Usage of Outer Joins

Outer joins are treated differently from inner joins in that the optimizer does not attempt to rearrange the join order of outer join tables as it does to inner join tables. The outer table (the left table in LEFT OUTER JOIN and the right table in RIGHT OUTER JOIN) is accessed first, followed by the inner table. This fixed join order can lead to execution plans that are less than optimal.

We have seen cases where inner join semantics are implied but outer join syntaxes are actually used. Since they often lead to the same results, this kind of misuse is often neglected. The reality is that they can lead to poor performance.

Sometimes you can avoid the use of outer joins by making small changes to the database design. For example, in the Northwind database, you have a product that doesn't belong to any category. One way to express this is to insert NULL as "Category ID" for that product in the Products table. Subsequent retrieval of products together with category info will require an outer join, as shown below.

SELECT P."Product Name", C."Category Name" FROM Products P 
LEFT OUTER JOIN Categories C ON P."Category ID" = C."Category ID"

Another way to accomplish this is to create an additional category entry named "NoCategory" or "Miscellaneous" and assign the "Category ID" of the product to this instead. Since every product now has some category associated with it, the outer join is not necessary anymore, as shown below.

SELECT P."Product Name", C."Category Name" FROM Products P 
INNER JOIN Categories C ON P."Category ID" = C."Category ID"

Consider Using Parameterized Queries

If your application runs a series of queries that are only different in some constants, you can use a parameterized query to accomplish them. For example, to find out orders by different customers, you can run the following query:

SELECT "Customer ID" FROM Orders WHERE "Order ID" = ?

Parameterized queries yield better performance by compiling the query only once and executing the compiled plan multiple times. Remember, programmatically you need to hold on to the command object that contains the cached query plan. Destroying the previous command object and recreating another one wipes out the cached plan, making it necessary to re-compile. If you need to run several parameterized queries in interleaved fashion, you can create a few command objects, each caching the execution plan for a parameterized query. This way, you effectively avoid re-compilations for all of them.

The following chart shows the performance comparison of a non-parameterized vs. parameterized approach of the query shown above. Keep in mind that the simpler the query operation, the more improvement you will see by using a parameterized query, because more overhead is eliminated by not re-compiling.

Figure 5: Performance comparison of non-parameterized vs. parameterized queries

Decide Whether to Query or Not to Query

SQL Server CE query processor is a powerful tool for querying data stored in your relational database. However, the power comes at a cost. There is an inherent cost associated with any query processor, as it needs to compile, optimize, and generate an execution plan before it starts doing the real work by carrying out the plan. This is particularly true with simple queries that don't take long to finish. Therefore, implementing the query yourself can sometimes provide vast performance improvement. If every millisecond counts in your critical component, it is recommended that you consider the alternative of implementing the simple queries yourself. For large and complicated queries, the job is still best left to the query processor.

For example, suppose you need to look up the customer ID for a series of orders by their order IDs. There are a couple ways to accomplish it. You could perform the following steps for each lookup:

Open the Orders base table.
Find the row, using the given "Order ID".
Retrieve the "Customer ID".

Or you could issue the following query for each lookup:

SELECT "Customer ID" FROM Orders WHERE "Order ID" = <the given order id>

The following chart compares the performance of the two approaches.

Figure 6: Performance comparison of base table seek vs. SELECT query

As you might expect, the query based solution is simpler but slower. SQL Server CE query processor translates the declarative SQL statement into the same three operations as you would implement manually, which are then carried out step by step. Your choice should depend on whether simplicity or performance is more important for your application.

Miscellaneous

It is a good practice to always minimize the size of the result set returned. Avoid using "SELECT *", and always spell out the projection list.

As the number of join tables increases, the query running time increases substantially. You can try the database de-normalization approach discussed earlier to reduce query complexity. You can also break the original query into a few smaller ones, and perhaps use temporary work tables to store intermediate results.

Only one index can be used per base table per query. You can index join columns to speed up a join operation. You can index filter columns to speed up filter evaluation. Which set of columns should you index then? The general approach is to index both sets of columns. The optimizer will choose one intelligently, based on data characteristics and the actual query.

With SQL Server CE 2.0, you are able to specify the location of a temporary database. By default, it is the same as the working database. Be aware of the performance consequences of moving the temporary database from RAM to compact flash card or microdrive. Because reading and writing data is slower with compact flash card and microdrive, the query performance might suffer.

It is recommended that you compact the database periodically. The compact operation rebuilds the database. It gets rid of the pockets in data and index pages, re-orders records in primary key order, groups similar pages, and refreshes the statistics associated with the database. All of these potentially contribute to performance improvement.

Going Forward

SQL Server CE provides a powerful yet compact database engine that serves a wide variety of enterprise and personal information management applications. Because mobile RDBMS is an emerging market, where the needs of applications can change significantly and quickly, SQL Server CE query processor is designed to be easily extended in a variety of different ways. In the future, we plan to add new SQL features, new optimization and execution algorithms, and new customization capabilities. This extensible architecture will help in subsequent releases to tailor SQL Server CE to the varying demands of handheld and embedded devices.

The following diagram illustrates the schema of the Northwind database, which is included with the sample applications that ship with SQL Server CE.

Figure 7: Northwind sample database schema