XML Indexes in SQL Server 2005

 

Bob Beauchemin
SQLskills.com

August 2005

Summary: Use the relational query engine in SQL Server 2005 to make a single query plan for the SQL and XQuery parts of your queries, and make the implementation of XML queries fast and easy to predict and tune. (12 printed pages)

Contents

Introduction
XQuery and the XML Data Type
Types of XML Indexes
How the Indexes Help
XQuery and Schema-Validated Columns
Index and Workload Analysis
Other Tips to Speed Up Your XML Queries
Wrap-up

Introduction

Although not everyone would agree, one of the main reasons for the success of the relational database has been the inclusion of the SQL language. SQL is a set-based declarative language. As opposed to COBOL (or most .NET-based languages for that matter), when you use SQL, you tell the database what data you are looking for, rather than how to obtain that data. The SQL query processor determines the best plan to get the data you want and then retrieves the data for you. As query-processing engines mature, your SQL code will run faster and better (less I/O and CPU) without the developer making changes to the code. What development manager wouldn't be pleased to hear that programs will run faster and more efficiently with no changes to source code because the query engine gets better over time?

One of the ways to allow the query processor the choice of optimized access is to create indexes over the data. Creating the correct index can dramatically change how the query engine evaluates the query. You decide which indexes to create by analyzing which queries you actually perform and figuring up how the engine could optimize those queries. A tool to analyze query workloads and suggest indexes comes with SQL Server. In SQL Server 2005, this tool is Database Tuning Advisor.

In the early days of XML, imperative programming (navigation through the XML DOM) was all the rage. The XQuery language in general and XQuery inside the database in particular make it possible for the query engine writers to approach the task of optimizing queries against XML. The chances of success are good because these folks have 20 years or so of practical experience optimizing SQL queries against the relational data model. The SQL Server 2005 implementation of XQuery over the built-in XML data type holds the same promise of a declarative language, with optimization through a query engine. And the query engine that SQL Server 2005 XQuery uses is the one built-in to SQL Server. SQL Server 2005 XQuery uses the relational engine, with XQuery-specific enhancements. As an example, XQuery mandates that the results be returned in document order, even if you don't use "order by" in the query.

XQuery and the XML Data Type

You use XQuery in SQL Server 2005 through a series of built-in SQL methods on the XML data type. The XML data type is a new native type that can be used as a column in a table, procedure parameters, or as T-SQL variables. The built-in methods may be used with any instance of the XML data type. Table 1 contains a list of the five methods, their signatures, and what they do.

Table 1. XML Data Type Functions

Name Signature Usage
exist bit = X.exist(string xquery) Checks for existence of nodes, returns 1 if any output returned from query, otherwise 0
value scalar = X.value(

string xquery, string SQL type)

Returns a SQL scalar value from a query cast to specified SQL data type
query XML = X.query(string xquery) Returns an XML data type instance from query
nodes X.nodes(string xquery) Table-value function used for XML to relational decomposition. Returns one row for each node that matches the query.
modify X.modify(string xml-dml) A mutator method that changes the XML value in place

Note that each of these methods require XQuery (or XML DML in the case of modify) as a (n)varchar SQL input parameter. Each of these methods is used as a part of an "ordinary" SQL statement as in the following examples.

-- SQL query: return a one-column rowset containing an XML data type
SELECT invoice.query('
    (: XQuery program :)
    declare namespace inv="urn:www-develop-com:invoices";
    declare namespace pmt="urn:www-develop-com:payments";
    for $invitem in //inv:Invoice
    return
      <pmt:Payment>
        <pmt:InvoiceID> {data($invitem/inv:InvoiceID)} </pmt:InvoiceID>
        <pmt:CustomerName> 
           {data($invitem/inv:CustomerName)}
        </pmt:CustomerName>
        <pmt:PayAmt> 
           {data(sum($invitem/inv:LineItems/inv:LineItem/inv:Price))}
        </pmt:PayAmt>
      </pmt:Payment>
  ') AS xmldoc
 FROM xmlinvoice

-- Extract a value from XML data type and use in a SQL predicate
SELECT id
 FROM xmlinvoice
 -- XML.value must return a scalar value (XML singleton or empty sequence)
 WHERE invoice.value('
    (: XQuery program :)
    declare namespace inv="urn:www-develop-com:invoices";
    sum(//inv:Invoice/inv:LineItems/inv:LineItem/inv:Price)
  ',
  -- SQL data type
  'money') > 100

When the query processor evaluates the SQL query, it uses the SQL (relational) query engine. This applies to the XML portion of the query as well as the SQL portion. Because the same query processor is used for the entire query, the query produces a single query plan, as SQL queries always do. And that's where indexes come in. When the XML instance occurs as a column, that column can be indexed. The query processor can use XML indexes to optimize XQuery, just as SQL indexes can be used to optimize SQL queries.

Types of XML Indexes

SQL Server 2005 supports four different types of XML indexes. Since an XML index is somewhat different than a relational index, it is necessary to know their implementation before we approach how to use them for maximum effectiveness. There is a single "primary XML index" and three different flavors of "secondary XML index". And it turns out that the primary XML index isn't strictly an index on the original form of the XML.

The primary XML index on an XML column is a clustered index on an internal table known as the node table that users cannot use directly from their T-SQL statements. The primary XML index is a B+tree and its usefulness is due to the way that the optimizer creates a plan for the entire query. Although the optimizer can operate on the entire XML column as though it is a blob, when you need to execute XML queries, it is more often useful to decompose the XML into relational columns and rows. The primary XML index essentially contains one row for each node in the XML instance. By creating an example primary XML index by executing the following DDL, you can see the columns that the primary XML index contains.

-- create the table
-- the clustering key must be the primary key of the table
-- to enable XML index creation 
CREATE TABLE xmlinvoice (
  invoiceid INT IDENTITY PRIMARY KEY,
  invoice XML
)
GO
-- create the primary XML index
CREATE PRIMARY XML INDEX invoiceidx ON xmlinvoice(invoice)
GO
-- display the columns in the primary XML index (node table)
SELECT * FROM sys.columns c
 JOIN sys.indexes i ON i.object_id = c.object_id
 WHERE i.name = 'invoiceidx'
 AND i.type = 1

Here are the columns that this statement produces. Some terms that I'm using require further explanation later in this article.

Table 2. Columns in the node table

Column Name Column Description Data Type
id node identifier in ordpath format varbinary(900)
nid node name (tokenized) int
tagname tag name nvarchar(4000)
taguri tag uri nvarchar(4000)
tid node data type (tokenized) int
value first portion of the node value sql_variant
lvalue long node value (pointer) nvarchar(max)
lvaluebin long node value in binary (pointer) varbinary(max)
hid path (tokenized) varchar(900)
xsinil is it NULL (xsi:nil) bit
xsitype does it use xsi:type bit
pk1 primary key of the base table int

There are 11 columns in the primary XML index besides the base table’s primary key, which can be a multi-column key; it contains enough data to execute any XQuery. The query processor uses the primary XML index to execute every query except for the case where the entire document will have to be output. In that case it's quicker to retrieve the XML blob itself. Although having the primary XML index is a vast improvement over creating it afresh during each query, the size of the node table is usually around three times that of the XML data type in the base table. The actual size depends upon the XML instances in the XML column—if they contain many tags and small values, more rows are created in the primary XML index and the index size is relatively larger; if there are few tags and large values, then few rows are created in the primary XML index and the index size is closer to the data size. Take this into consideration when planning disk space. This is because the node table contains explicit representations of information (such as the path and node number) that is a different representation of information inherent in the structure of the XML document itself.

The primary XML index is clustered on the primary key of the base table (the pk1 column in the example above) and a node identifier (id). However, it is not a clustered index on the base table xmlinvoice. It is necessary to have a primary key on the base table to create the primary XML index. That primary key is used in a join of the XQuery results with the base table. The XML data type itself cannot be used as a primary key of the base table and so the invoiceid column was included in the base table definition to satisfy the requirement.

The node identifier is represented by a node numbering system that is optimized for operations on the document structure (such as parent-child relationship and the relative order of nodes in the document) and insertion of new nodes. This node numbering system is known as ordpath. Some of the reasons for numbering all the nodes are to maintain document order and structural integrity in the query result. These are not requirements of relational systems, but are requirements in XML. Using the ordpath numbering system makes satisfying these requirements easier for the query engine; the ordpath format contains document order and structure information. See the paper ORDPATHs: Insert-Friendly XML Node Labels by Patrick and Elizabeth O'Neil et al., for all of the details on ordpath.

Once the primary XML index has been created, an additional three kinds of secondary XML index can be created. The secondary XML indexes assist in certain types of XQuery processing. These are called the PATH, PROPERTY, and VALUE indexes. For example, you can create a PATH secondary index using the primary XML index created above like this:

CREATE XML INDEX invpathidx ON xmlinvoices(invoice)
 USING XML INDEX invoiceidx FOR PATH

Secondary XML indexes are actually indexes on the node table. The PATH index, for example in a normal non-clustered index on the (HID, VALUE) columns of the node table. To see the key columns for all the indexes on the node table in index order, you can execute this query:

select i.name as indexname, c.name as colname, ic.* 
   from sys.index_columns ic 
   join sys.columns c  on ic.column_id = c.column_id
                       and ic.object_id = c.object_id
   join sys.indexes i  on ic.object_id = i.object_id
                       and ic.index_id = i.index_id
   where ic.object_id = 
    (select object_id from sys.indexes
     where name = 'invoiceidx' and type = 1)
   order by index_id, key_ordinal

How the Indexes Help

Now that we've seen what the XML indexes consist of in terms of columns and rows, let's see how they are useful with particular kinds of XML queries. We'll begin with a discussion of how SQL Server 2005 actually executes a SQL query that contains an XML data type method.

When an SQL-with-XML query is executed against a table containing an XML data type column, the query must process every XML instance in every row. At the top level, there are two ways that such a query can be executed:

  1. Select the rows in the base table (that is, the relational table that contains the XML data type column) that qualify first, and then process each XML instance using XQuery. This is known as top-down query processing.
  2. Process all XML instances using the XQuery first, and then join the rows that qualify to the base table. This is known as bottom-up query processing.

The SQL Server query optimizer analyzes both the XQuery pieces and relational pieces of the query as a single entity, and creates a single query plan that encompasses and best optimizes the entire SQL statement.

If you've created only the primary XML index it is almost always used in each step of the XQuery portion of the query plan. It is better to use the primary XML index to process the query in almost every case. Without a primary XML index, a table-valued function is used to evaluate the query, as can be seen in this query plan fragment:

ms345121.xmlindexes_01a(en-US,SQL.90).gif

Once the primary XML index is in place, the optimizer chooses which indexes to use. If you have all three secondary indexes, there are actually four choices:

  1. Index scan or seek on the primary XML index
  2. Index scan or seek on node table's PATH index
  3. Index scan or seek on node table's PROPERTY index
  4. Index scan or seek on node table's VALUE index

The primary XML index is clustered (data is stored) in XML document order; this makes it ideal for processing subtrees. Much of the work in XML queries consists of processing subtrees or assembling an answer by using subtrees, so the clustered index on the node table will be frequently used. Here's an example of how the same query plan looks after only the primary XML index is created.

ms345121.xmlindexes_02(en-US,SQL.90).gif

The PATH, PROPERTY, and VALUE index are more special purpose and are meant to help specific queries. We'll continue with examples that use the exist() and query() methods on the XML data type.

The PATH XML index is built on the Path ID (HID) and Value columns of the primary XML index. Because it contains both paths and values, if you need the value (for comparison) by using the path, it is a good "covering" index, as shown in this query:

-- this only uses path index if value index is not available
select * from xmlinvoice 
where invoice.exist('/Invoice/@InvoiceID[. = "1003"]') = 1

You need to have two conditions for the PATH index to be useful. You'll need the path to the node you're using and the path should not contain predicates or wildcards. Knowing both the path and value enables index seeks into the PATH index. The following example uses the PATH to determine which rows contain InvoiceID 1003 and the primary XML index to find the Invoice node serialize its value as output:

select invoice.query('/Invoice[@InvoiceID = "1003"]')
  from xmlinvoice

Changing the query to contain both a predicate and wildcard in the path does not use the PATH index:

select invoice.query('/Invoice[@InvoiceID = "1003"]//LineItem')
  from xmlinvoice

The PROPERTY index contains the primary key of the base table (pk1), Path ID (HID), and Value, in that order. Because it also contains the primary key of the base table, it helps for searching multi-valued properties in the same XML instance. Even though all the Invoice documents have the same specific structure, this is not known to the XQuery processor and therefore every attribute and subelement is considered part of a property bag. We'll see later that typing the XML by using an XML schema lessens the number of unknown property bags the processor has; the structure is known through the schema. In the preceding example, PROPERTY index is used to scan for CustomerName elements under Invoice; CustomerName is considered part of a property bag of subelements. Even when attributes are used in predicates, property CustomerName is useful. In the example below, the PROPERTY index is used to search by Invoice elements anywhere in the document they occur.

select * from xmlinvoice
where invoice.exist('//Invoice/@InvoiceID[. = "1003"]') = 1

In queries like this, the PROPERTY index is preferred over the path index if both are available because the PATH is not very selective. If you change the selectivity of the comparison predicate:

select * from xmlinvoice
where invoice.exist('//Invoice/@InvoiceID[. > "1003"]') = 1

then the PROPERTY index will be used.

The VALUE index contains the same index columns as the PATH index, Value and Path ID (HID), but in the reverse order. Because it contains the value before the path, it’s useful for expressions that contain both path wildcards and values, such as:

-- uses value index if the search value "Mary Weaver" is more selective than the path
select * from xmlinvoice
where invoice.exist('/Invoice/CustomerName/text()[. = "Mary Weaver"]') = 1

-- uses value index due to path wildcard and attribute wildcard
//Invoice/LineItems/LineItem/@*[. = "special"]

Note that, if the preferred type of secondary XML index is not available, an alternate secondary index or the primary XML index may be used. In the example above, if the VALUE secondary index is not available the query processor might decide to use the primary XML index. If the PROPERTY secondary index is not available the processor often uses a two-step process combining PATH and the primary XML index; sometimes a two-step process is used even with the PROPERTY index. Adding another step (i.e., JOIN) to the query plan almost always results in a slower query.

So far, we've only been using the exist() method on the XML data type using a single path and predicate. Things work approximately the same way with the other XML methods. The query method may use node construction in addition to selection. Construction is optimized by using a special tag "Desc" that can be seen in the query plan. Any part of the XQuery that requires selection, however, will use the same (sub) plan as we've been seeing. Bear in mind that any index observations are made with specific sets of data; your results may vary.

XQuery and Schema-Validated Columns

When an XML Schema Collection validates the XML data type column, the order and structure of the documents and the cardinality of each subelement may be known at query compilation time. This allows the query optimizer more chances to optimize the query. We can specify an XML schema for Invoices in a schema collection named invoice_xsd and restrict the XML column to contain only documents (the XML data type can ordinarily contain documents or fragments), and it would look like this:

-- create the table, must have primary key 
CREATE TABLE xmlinvoice2(
  invoiceid INTEGER IDENTITY PRIMARY KEY,
  invoice XML(DOCUMENT invoice_xsd)
)
GO

When we issue the same queries against a schema-valid column, there seem to be three major changes in query plan and index usage.

  1. More bottom-up type queries. Because of the XML schema, the number of nodes that need to be searched for a specific document is known, and sometimes fewer than the number of documents (rows) in the table. When this occurs, a bottom-up query will filter away more of the data.
  2. Greater use of the VALUE secondary index, as opposed to PROPERTY and PATH. Because of the schema, the processor knows that a specific element occurs in only one place in the document, and also that the type of values that the VALUE index is more important and useful and filtering can be done in one step instead of two.
  3. If an element is defined as a numeric or integral data type, scans for a numeric range (e.g., LineItems priced between $20 and $30) can be done more efficiently. No separate step consisting of data type conversion is required.

As an example of the greater usage of VALUE index, the following query changes from a top-down query with a two-step (sub)plan using PROPERTY index and clustered node table index to a bottom-up query with a one-step (sub)plan using the VALUE index.

select *
from xmlinvoice
where invoice.exist('/Invoice/CustomerName[. = "Mary Weaver"]') = 1

The DOCUMENT qualifier is used to infer the cardinality of 1 for the top-level element. DOCUMENT means that the column must contain a document with a single XML root element (no fragments); this is used for data validation and static type inference. However, a predicate expression that starts with //Invoice is optimized differently (uses VALUE index) than one that starts with /Invoice (uses PATH index). The performance of the two will likely be close.

Index and Workload Analysis

Given the fact that the primary XML index is taking up three times the space of the XML content in the data type, if you could choose only one secondary XML index, which one would you choose? It really depends on your workload. The good news is that, because SQL and XQuery are combined to yield a single query plan, ordinary plan analysis, via any of the showplan methods including graphic showplan in SQL Server Enterprise Manager, will work just as well for XML indexes as with relational indexes. You create the index and observe the effect on the query plan. There are a few caveats, however. First, you cannot force index query hints on XML indexes for the purpose of comparing different index strategies for performance. Also, although all four XML indexes on an XML column are used for query optimization and are "ordinary" relational indexes, Database Tuning Advisor does not suggest XML indexes.

When reading a showplan for a SQL/XQuery query, there are a couple of new XQuery specific items to recognize:

  1. Table-Valued Function XML Reader UDF with XPath Filter—this item refers to the on-the-fly creation of a rowset having the node table format (the node table is not actually created) for the XQuery portion of the query. You'll only see this when doing queries on an XML column when no XML indexes exist.
  2. UDX—this item refers to internal operators for XQuery processing. There are five such operators; the name of the operator can be found in the "Name" property if you bring up the Properties window (note: this does not show up in the "hover-over" query step information). The operators are:
    1. Serializer UDX—serializes the query result as XML
    2. TextAdd UDX—evaluates the XQuery string() function
    3. Contains UDX—evaluates the XQuery contains() function
    4. Data UDX—evaluates the XQuery data() function
    5. Check UDX—validates XML being inserted

Other Tips to Speed Up Your XML Queries

Use specific XQuery query styles: You might notice that using the dot (.) in a predicate produced a different (and simpler and faster) query plan than using the attribute name in the predicate. In the examples above, compare the two queries:

select * from xmlinvoice
where invoice.exist('/Invoice/@InvoiceID[. = "1003"]') = 1

and

select * from xmlinvoice
where invoice.exist('/Invoice[@InvoiceID = "1003"]') = 1

Although the result is the same the latter form usually requires one more evaluation step. This is because the query processor is evaluating only one node in the first form (using the PATH index if its present) and is using two evaluation steps (one for /Invoice, one for /Invoice/InvoiceID) in the second form. Although looking at the plan for two "equivalent" queries might seem strange for XML aficionados, SQL query tuners have been doing this for years. Note that the two queries above only produce the same results when using the XML data type exist method, they produce different results when used with the query method.

Avoid wildcards in your queries if possible: Wildcards in a path expression containing elements (e.g., /Invoice//Sku/*) are only useful if you don't know the exact structure of the document, or if the Sku element can occur at different levels of hierarchy. In general, you should structure your document to avoid this, although this is not possible when your data structure uses recursion.

Hoist often searched XML values to relational values: If a given attribute is used frequently in predicates, you can save query-processing time by making this a computed column or redundant column in your relational table. If you always find yourself search on InvoiceID, making it a column allows top-down queries to work more effectively. You might not even have to use the XML instance in the query, if you want the entire document. Refer to Performance Optimizations for the XML Data Type by Shankar Pal et al., for examples of how to do this with both single and multi-valued attributes.

Use full-text search in conjunction with XQuery: To search XML documents or do content-sensitive queries on text, you can combine SQL Server full-text search with XQuery. Full-text search will index the text nodes of XML documents, though not the elements or attributes. You can use the FULLTEXT contains verb to do stem-based or context-based queries over an entire collection of documents (this is most often the top-down part of the query) to select individual documents to operate on, then use XQuery to do structural element and attribute-sensitive structures. Remember that the XQuery contains verb is not at all the same as the FULLTEXT contains verb. XQuery contains is a substring-based function and the SQL Server 2005 implementation uses a binary collation. See XML Support in Microsoft SQL Server 2005 by Shankar Pal et al., for an example of combining fulltext and XQuery.

Wrap-up

I hope you've enjoyed the tour through the XML indexes and other hints to make your XML queries run faster. Remember that, as with any index, excessive use of XML indexes can make insert and modification methods run slower, because the index is maintained along with the raw data. This is especially true of the node table, because the entire document must be shredded during each insert, although modification does not require replacing the entire document. XML indexes should be managed like other indexes with respect to dropping and recreating the indexes in conjunction with bulk loading, index defragmenting, and other database administration techniques.

Using the mature relational query engine to produce a single query plan for both the SQL and XQuery parts of the query should make the SQL Server 2005 implementation of XML queries one of the fastest and easiest to predict and tune. Use this power to your advantage.

 

About the author

Bob Beauchemin is a database-centric application practitioner and architect, instructor, course author, writer, and Director of Developer Skills for SQLskills. Over the past two years he's been teaching his SQL Server 2005 course to premier customers worldwide through the Ascend program. He is lead author of the book "A First Look at SQL Server 2005 For Developers", author of "Essential ADO.NET" and written articles on SQL Server and other databases, ADO.NET, and OLE DB for MSDN, SQL Server Magazine, and others. Bob can be reached at bobb@sqlskills.com.