SQL Server: XML To The Max: Get More Power Out Of Your SQL Server

Article
09/08/2016

SQL Server

XML To The Max: Get More Power Out Of Your SQL Server

Ward Pond

Prerelease info in this article is subject to change.

At a Glance:

How SQL Server 2000 stores XML data
Getting XML into and out of SQL Server
New T-SQL syntax for handling XML data

SQL Server 2000

SQL Server 2005

T-SQL

XML

SQL Server 2000 is the first release of the Microsoft flagship database management system to include support for the production and manipulation of XML structures using T-SQL.

In this article, I will present a brief overview of how you can now manipulate XML in SQL Server™ 2000. The focus of the article is intra-system communication between a SQL Server 2000 T-SQL stored procedure and a Web service or user interface component.

I'll discuss how to access data contained in an XML string passed into a stored procedure and how to produce XML output. Figure 1 provides a high-level overview of these constructs, and I'll introduce the following XML-specific T-SQL syntax:

sp_xml_preparedocument
OPENXML
sp_xml_removedocument
FOR XML EXPLICIT

Figure 1 Some New T-SQL Syntax in Action

Figure 1

A Sample Case: Batch Validation of User Input

In SQL Server 2000, native storage of XML data is not supported. Rather, an XML string is stored in a (n)(var)char, (n)text memory variable, or table column. In almost all development scenarios, this memory variable will be an input parameter to a stored procedure; therefore, I'll take that same approach here. (You should note that passing T-SQL between tiers is always risky from a security standpoint and generally ill-advised from a performance perspective, two excellent reasons to avoid the practice altogether.)

Let's take a look at a stored procedure my team built several years ago called sptxValidateLookupData. The purpose of this stored procedure is to validate user inputs to a particular system's workflow. In order to minimize database round-trips, the designers of this system elected to collect all the user input and send that which requires validation to the database in a single XML document. The stored procedure runs the validations and returns the results to the calling process, also in a single XML document. The sptxValidateLookupData procedure serves a number of diverse workflows, so both existence and non-existence tests may be requested for any datum or domain in a particular batch. Typical XML input to this stored procedure might appear as shown in the following code:

<ValidateData>
<Validate Type="Countries" Name="CountryCode"   
  Value="JA" Test="Not Exists"/>
<Validate Type="Countries" Name="CountryCode" 
  Value="BO1" Test="Exists"/>
<Validate Type="Languages" Name="LanguageCode" 
  Value="EN" Test="Exists"/>
<Validate Type="ContactPreferences" 
  Name="ContactPreferenceCode" Value="EN" 
  Test="Exists"/>
</ValidateData>

Under the <ValidateData> root, each <Validate> node describes a validation to be performed. Type describes the domain to be tested, Name describes the attribute to be tested, Value contains the value to be tested, and Test indicates whether to check for existence or non-existence of Value in the Name column of the Type domain. Note that while the sample XML contains four validations against 3 domains, the version of this stored procedure currently in production supports 17 domains and an arbitrary number of tests against them—a highly reusable, highly extensible, highly performing, easily maintained, and, best of all, dirt-simple validation component!

Results are returned in XML as well. If all of the tests in a particular batch evaluate to TRUE (if they turn out as the Test node suggests), an empty <Results/> tag is returned. If, however, any of the tests fail, a list of errors is returned:

<Errors>
<Error ErrorMessage="JA exists in Countries" 
  FieldName="CountryCode"/>
<Error ErrorMessage="BO1 does not exist in 
  Countries" FieldName="CountryCode"/>
<Error ErrorMessage="EN does not exist in 
  ContactPreferences" 
  FieldName="ContactPreferenceCode"/>
</Errors>

Using XML

SQL Server Books Online References

The following topics will get you started using XML in SQL Server 2005. Additional information can always be found on the TechNet Web site and at SQL Server Home.

Contents/What's New/XML Integration of Relational Data
Contents/XML and Internet Support (particularly, Retrieving and Writing XML Data)
sp_xml_preparedocument, sp_xml_removedocument
OPENXML
FOR XML EXPLICIT, FOR XML AUTO
Using EXPLICIT Mode

In SQL Server 2000, XML is passed into a T-SQL process as a simple string. Before it can be manipulated relationally or hierarchically, it must be identified to SQL Server as XML or "prepared." This is accomplished via a call to the system stored procedure sp_xml_preparedocument. Let's start at the top of sptxValidateLookupData (see Figure 2). The sp_xml_ preparedocument system stored procedure reads XML text provided as input (the @XMLString parameter in the call in Figure 2), then parses the text using the MSXML parser and provides the parsed document in a state ready for consumption by the OPENXML rowset function.

Figure 2 sptxValidateLookupData

create procedure sptxValidateLookupData 
@XMLString ntext
as

set nocount on

declare @idoc int, @Name nvarchar(30), @Value nvarchar(300), @Test nvarchar(30),
        @Result int, @Type nvarchar(30), @TestResult int

--    prepare the input XML for OPENXML manipulation
exec sp_xml_preparedocument @idoc OUTPUT, @XMLString

This parsed document is a hierarchical representation of the various nodes (elements, attributes, text, comments, and so on) in the XML document. This internal representation is stored in the server's cache, which has a maximum size of one-eighth total server memory and therefore needs to be managed carefully in heavily utilized or memory-constrained environments.

The sp_xml_preparedocument procedure returns a handle (the @idoc value in the call in the Figure 2) that can be used to access the newly created internal representation of the XML document. This value is used as a parameter to the OPENXML rowset function of the SELECT statement. OPENXML is your gateway to the prepared XML, and may be used within a SELECT statement the same way you'd use a table or a view (see Figure 3).

Figure 3 Create Table Variable

--    create a table variable to drive the tests
declare @tempValidateLookupData table (
    [Type] nvarchar(30),
    [Name] nvarchar(30),
    [Value] nvarchar(300),
    [Test] nvarchar(30),
    [TestResult] int
)

--    populate the table variable with the tests to be performed
insert @tempValidateLookupData
select [Type], [Name], [Value], [Test], NULL
    from OPENXML (@idoc, '/ValidateData/Validate')
    with ([Type] nvarchar(30), [Name] nvarchar(30), [Value] nvarchar(300), [Test] nvarchar(30))

I created and populated the table variable in order to avoid multiple OPENXML calls that would be less efficient when compared with accessing the table variable multiple times. It's a best practice to copy the contents of the XML to a table variable to avoid repeated, inefficient access to the XML (if you're going to access the XML only once, though, just run the OPENXML query without creating the table variable first). The second statement in Figure 3 copies the input XML into the @tempValidateLookupData table variable.

Opening Up OPENXML

Now let's take a closer look at the OPENXML syntax shown here:

OPENXML(idoc int [in],rowpattern nvarchar[in],[flags byte[in]]) 
[WITH (SchemaDeclaration | TableName)]

The idoc parameter is the document handle of the internal representation of the XML document—in other words, the value passed back from an earlier call to sp_xml_preparedocument. Note that it's possible to manipulate multiple XML strings in the same stored procedure. In such a case, a call to sp_xml_preparedocument must be made for each XML string, and a separate memory variable must be declared for each returned handle.

The rowpattern parameter is used to identify the nodes in the XML document associated with the idoc handle to be processed as rows. This tells the XML parser where in the XML document it can find the data you want.

The flags parameter describes precedence between attribute-centric and element-centric mapping for the query. SQL Server 2000 defaults to attribute-centric mapping when this parameter is omitted, which is sufficient for the purposes of this discussion. Further information on this is available in the SQL Server Books Online topic, OPENXML.

The WITH section of the OPENXML rowset function tells SQL Server 2000 how to map the contents of the XML into SQL Server 2000 datatypes. You can either explicitly delineate the columns or reference a database table with the appropriate structure. Every column from the XML referenced in the SELECT statement must appear in the WITH section (either explicitly delineated or in the referenced table). In this sample

-- populate the table variable with the tests -- to be performed
insert @tempValidateLookupData
select [Type], [Name], [Value], [Test], NULL
    from OPENXML (@idoc, '/ValidateData/
    Validate')
    with ([Type] nvarchar(30), [Name] nvarchar(30), [Value] nvarchar(300), [Test] nvarchar(30))

the column names in the SELECT statement and WITH clause must match the attribute names in the input XML. Note also that since the XML is treated as a recordset, the validation program will run successfully whether it's passed one datum or twelve—the calling process simply passes as many nodes of XML as are required to complete its validations.

Releasing Memory

The internal representation of the XML document will remain in memory until the process's server connection is terminated or reset, or until the memory is explicitly released. The sp_xml_removedocument system stored procedure should be called as early as is practical to release this memory in order to maximize available server resources (note that you must specify the memory handle to release):

--    free the server memory used by the 
--    OPENXML image of the input XML
exec sp_xml_removedocument @idoc

This step is especially important in environments in which connection pooling is used. Connections in such environments might stay live for days, and continued instantiation of XML documents in memory without corresponding releases could result in performance degradation as free memory decreases. Connection pooling is designed to reset a connection if the connection is reused by somebody else, which in turn should release XML handles. However, it is a best practice to explicitly release handles rather than depend upon dropped connections to do it for you.

Once the XML is parsed, sptxValidateLookupData runs a large IF branch to find the requested test and sets the results of the @TestResult memory variable as appropriate for the results of the test. The table variable is then updated with the results of the test, as you can see in Figure 4.

Figure 4 Find Requested Test

--    now use the table variable to drive the tests
--    by stepping through the records with NULL TestResult values
while exists (select TestResult from @tempValidateLookupData where TestResult is null) begin

--    pick a record to validate
    Select top 1 @Type=[Type], @Name=[Name], @Value=Value, @Test=Test
        from @tempValidateLookupData
        where TestResult is null

--    now run the appropriate test..

--    Countries (presented as a sample)
    if    @Type = 'Countries' begin
        if exists (select CountryCode from dbo.Country where CountryCode = convert(nvarchar(4), 
            @Value))
            select    @TestResult = 
            CASE    when @Test = 'Exists' then 1
                else 0
            end
        else
            select    @TestResult = 
            CASE    when @Test = 'Not Exists' then 1
                else 0
            end
    end

--    (sixteen other validations follow and are omitted for brevity)

--    now update the appropriate record with the test result
    update   @tempValidateLookupData
    set      TestResult = @TestResult
    where    Name = @Name
    and      Value = @Value
    and      Test = @Test
end

Syntax in SQL

Now that the validations are complete, you need to return the results to the calling process. SQL Server 2000 provides several mechanisms for building XML output via the SELECT statement's FOR XML directive. Before I continue my work on sptxValidateLookupData, I need to take a second, brief detour into new SQL syntax.

There are three types of FOR XML in SQL Server 2000. FOR XML RAW and FOR XML AUTO provide rudimentary XML output with minimal hassle—and a concomitant lack of control over the format of the output. Most serious SQL Server 2000 programmers use FOR XML EXPLICIT. In an EXPLICIT mode query, the programmer fully controls the shape of the XML document returned by the query and also assumes responsibility for assuring that the XML is well formed and valid.

What About EXPLICIT Content?

There are several exacting syntactic requirements for producing a proper EXPLICIT mode query. Every EXPLICIT mode query must contain two metadata columns. The first column specified in the SELECT statement must be named Tag, and be of datatype int. This is the tag number of the current element, essentially a subtree type number. The second column specified must be named Parent and must be of datatype int. This is the tag number of that element's parent element. These columns are used to determine the hierarchy of the XML tree. If a row has a Parent value of 0 or NULL, it is placed at the top of the XML hierarchy. With this single exception, all Parent values must correspond to a previously declared tag value. Note that a resultset should have exactly one row with a Parent value of 0 or NULL, and it must be the first row of the resultset (if more than one row with a parent value of 0 or NULL is present, an XML fragment will be generated).

The names of the remaining columns in the query must follow a specific format that associates the element names with the tag numbers and provides attribute names for the generated XML. The column name format that must be followed is

[ElementName!TagNumber!AttributeName!Directive]

where ElementName is the name of the element (if "Countries" is specified as ElementName, then the resulting element tag will be <Countries>).

TagNumber is the tag number of the element. Along with the Tag and Parent metadata columns, TagNumber is used to express the hierarchy of the XML tree. Every TagNumber corresponds to a single ElementName, and AttributeName is the name of the XML attribute, if applicable.

Working with Directive, along with several scenarios under which AttributeName can be NULL, is beyond the scope of this article. The SQL Server 2000 Books Online topic "Using Explicit Mode" is an excellent source for more information.

In addition, in order for the XML to be built correctly, the resultset must be ordered such that each parent is followed immediately by its children.

Returning now to sptxValidateLookupData, you'll recall from the specification statement that I have two resultset formats to return. If all of the tests in a particular batch evaluate to TRUE, I'll return an empty <Results/> tag. If, however, any of the tests fail, I'll return a list of errors in XML.

Figure 5 shows the code from sptxValidateLookupData that uses FOR XML EXPLICIT to generate the resultsets.

Figure 5 Using FOR XML EXPLICIT

--    if everything passed validation, return an empty result set in XML..
if not exists (select [TestResult] from @tempValidateLookupData where TestResult = 0)
    select 1 as TAG, 0 as parent,
    NULL as [Results!1!]
    for xml explicit

--    ..otherwise, return the failed test cases in XML.
else
    select 1 as TAG, 0 as parent,
    NULL as [Errors!1!],
    NULL as [Error!2!],
    NULL as [Error!2!ErrorMessage],
    NULL as [Error!2!FieldName]

    union all

    select 2 as TAG, 1 as parent,
    NULL,
    NULL,
    ltrim(rtrim(value)) + ' does not exist in ' + type,
    [name]

    from @tempValidateLookupData
    where [TestResult] = 0
    and test = 'Exists'

    union all

    select 2 as TAG, 1 as parent,
    NULL,
    NULL,
    ltrim(rtrim(value)) + ' already exists in ' + type,
    [name]

    from @tempValidateLookupData
    where [TestResult] = 0
    and test = 'Not Exists'

    for xml explicit

EXPLICIT mode syntax can get pretty verbose, so let's break it down. The first branch handles the simple case where all of my tests evaluate to TRUE:

-- if everything passed validation, return an -- empty result set in XML..
if not exists (select [TestResult] from @tempValidateLookupData where TestResult = 0)
    select 1 as TAG, 0 as parent,
    NULL as [Results!1!]
    for xml explicit

Since no AttributeName is specified, the following statement will create a single XML element called Results with no attributes and no children:

<Results/>

The ELSE branch here is far more interesting, as I'll construct the "error" XML for failed tests via a UNION of SELECT statements. Each individual query is responsible for rendering a particular node of the XML output (described in the new comments after each query), as shown in Figure 6.

Figure 6 Rendering Nodes

--    ..otherwise, return the failed test cases in XML.
else
    select 1 as TAG, 0 as parent,
    NULL as [Errors!1!],
    NULL as [Error!2!],
    NULL as [Error!2!ErrorMessage],
    NULL as [Error!2!FieldName]

--    (above query produces <Errors>)

    union all

    select 2 as TAG, 1 as parent,
    NULL,
    NULL,
    ltrim(rtrim(value)) + ' does not exist in ' + type,
    [name]

    from @tempValidateLookupData
    where [TestResult] = 0
    and test = 'Exists'

--    (above query produces nodes like
--    <Error ErrorMessage="BO1 does not exist in Countries" FieldName="CountryCode"/>)

    union all

    select 2 as TAG, 1 as parent,
    NULL,
    NULL,
    ltrim(rtrim(value)) + ' exists in ' + type,
    [name]

    from @tempValidateLookupData
    where [TestResult] = 0
    and test = 'Not Exists'

--    (above query produces nodes like
--    <Error ErrorMessage="JA exists in Countries" FieldName="CountryCode"/>)

    for xml explicit

Notice that the UNIONed queries are sequenced in a manner in which each child will immediately follow its parent, even though there can be children from multiple subqueries (the "exists" and "not exists" queries have the same "tag" and "parent" values). When the UNIONed results are passed through FOR XML EXPLICIT, the FOR XML serializer evaluates the [ElementName!TagNumber!AttributeName!Directive] column names along with the values in the Tag and Parent metadata columns to render the XML in the hierarchy specified by the programmer:

<Errors>
  <Error ErrorMessage="JA exists in Countries"
    FieldName="CountryCode"/>
  <Error ErrorMessage="BO1 does not exist in 
    Countries" FieldName="CountryCode"/>
  <Error ErrorMessage="EN does not exist in 
    ContactPreferences" 
    FieldName="ContactPreferenceCode"/>
</Errors>

Mission Accomplished

With the tools available to you, you're ready to query and generate XML in a SQL Server 2000 environment. As powerful as these tools are, however, the upcoming SQL Server 2005 promises increased programming power and even greater productivity improvements for anyone who wants to access data in XML using T-SQL. For a more detailed look at what XML handling is going to look like in SQL Server 2005, check out the sidebar that follows entitled "XML Programming in SQL Server 2005." You should be looking forward to the new syntax in the upcoming version.

XML Programming in SQL Server 2005

Microsoft has a slew of XML enhancements in the works for SQL Server 2005. For the past two years I've been developing a new system based on internal releases of this product (heaven for a SQL developer), and I'm proud to report that in the XML universe, the news is all good.

While all of the XML-oriented syntax introduced in SQL Server 2000 remains intact, there is also a huge suite of enhancements in SQL Server 2005. Here I'll provide an overview of these enhancements, and explain how they alter your approach to the sptxValidateLookupData stored procedure discussed in the accompanying article. The new SQL Server 2005 features I'll discuss are: the XML datatype; FOR XML PATH, TYPE; and the nodes() syntax.

While all of the XML techniques in SQL Server 2000 (such as passing XML strings into stored procedures via ntext parameters and using sp_xml_preparedocument, sp_xml_removedocument, FOR XML EXPLICIT, and OPENXML to manipulate it) are available in SQL Server 2005, replacements for each of them are introduced in the new product. And once you work with the new syntax, you'll never want to go back.

SQL Server 2005 includes some revolutionary changes to the storage engine itself, including three new datatypes: nvarchar(max), varbinary(max), and xml. Given the difficulties in handling ntext values in T-SQL, the new nvarchar(max) and varbinary(max) are an attractive alternative to using ntext if you want to continue using the same XML coding constructs that you did in SQL Server 2000. However, the provision of an XML datatype in the server engine opens up an entirely new range of programming possibilities.

Reviewing the Sample Case

You'll remember from the article that the sptxValidateLookupData stored procedure provides a single process for multiple, heterogeneous, variable batch validations of user inputs to Web pages. The directives for these validations are passed to the stored procedure in a single XML document. The stored procedure runs the validations and returns the results to the calling process, also in a single XML document.

The New Proc

As in the case of SQL Server 2000, I place the contents of the inbound XML document into a table variable. Rather than calling sp_xml_preparedocument and using OPENXML, however, I change the datatype of the parameter to xml (from ntext) and use the new nodes() syntax, as you see in Figure A.

Let's take a closer look at the nodes method. It provides a reference per row for every Validate element found under ValidateData elements in the @XMLString variable, bound as the node rowset using AS NODE(ref). Each column citation in the SELECT statement then uses the value method to extract the specified attribute values on a row-by-row basis, thus generating a row per Validate element containing the properties.

Each call to the value method contains the attribute name of the element to be extracted from the XML. This is delimited in single quotes and prepended by the @ sign, and is followed by the SQL datatype to which the attribute is to be mapped (which is also quote-delimited).

The logic that ran the validations in the original sptxValidateLookupData remains unchanged in the new sproc.

Returning the Results

You'll recall from the original specification statement that there are two resultset formats to return. If all of the tests in a particular batch evaluate to TRUE, an empty <Results/> tag will be returned. If, however, any of the tests fail, a list of errors in XML will be returned.

Figure B shows the new code from sptxValidateLookupData that uses FOR XML PATH, TYPE to generate the resultsets. Note how much more compact this syntax is than the previous FOR XML EXPLICIT efforts.

As before, the first branch handles the simple case where all of our tests evaluate to TRUE, as shown here:

--  if everything passed validation, return
--  an empty result set in XML...
if not exists (select [TestResult] from
  @tempValidateLookupData where TestResult = 0)
    select null
    for xml path ('Results'), type

The ELSE branch still constructs my "error" XML for failed tests, but this time it does so via a series of nested SELECT FOR XML PATH, TYPE statements. The explicit column names will direct SQL Server to use these names as XML attribute names (they must be single-quote delimited, and must be prepended with the @ sign). The FOR XML PATH ('Error') directives instruct the server to wrap the XML created in the inner selects in an Error element, while the FOR XML Path ('Errors') directive to the outer SELECT provides a root element called 'Errors'. Thus, this SQL will generate the same resultset as the earlier, far more verbose FOR XML EXPLICIT query.

While the immediate impact might not be apparent, the move away from the UNIONed SELECTs used in FOR XML EXPLICIT results in much more compact and maintainable code in heterogeneous queries. As an example, my team recently rewrote one 5000-line SQL Server 2000 user-defined function (which was basically a 43-level FOR XML EXPLICIT query) in 497 lines using this new syntax.

If you wanted to stick with the UNION model (which you typically won't), you could also write the second SELECT as shown in Figure C. This approach wouldn't make much sense if your internal XML nodes contained differing numbers of attributes. One of the major benefits of the new nested syntax is that the subqueries aren't required to have the same number of columns, as UNIONed SELECTs are. It would, however, serve to introduce the FOR XML ROOT construction. This allows a root element for the resulting XML to be declared without building an explicit SELECT for it, as I did in the first sample. For more on this, see XML Options in Microsoft SQL Server 2005.

With these new tools, you're ready to query and generate XML in a SQL Server 2005 environment even more efficiently than before. I've only scratched the surface of what's possible with this new syntax. There is very little you can't do with XML in SQL Server 2005. For all the latest SQL Server 2005 resources, you should visit Introducing SQL Server 2005.

Ward Pond has programmed relational database software since 1980, and has worked at Microsoft since 1999. When he's not tweaking the last millisecond of performance out of a SQL Server database, he can be found at home with his family, dogs, and guitars.