The Best of Both Worlds: Combining XPath with the XmlReader

 

Dare Obasanjo and Howard Hao
Microsoft Corporation

May 5, 2004

Download the XPathReader.exe sample file.

Summary: Dare Obasanjo discusses the XPathReader, which provides the ability to filter and process large XML documents in an efficient manner using an XPath-aware XmlReader. With the XPathReader, one can sequentially process a large document and extract an identified sub-tree matched by an XPath expression. (11 printed pages)

Introduction

About a year ago, I read an article by Tim Bray entitled XML Is Too Hard For Programmers, in which he complained about the cumbersome nature of push model APIs, like SAX, for dealing with large streams of XML. Tim Bray describes an ideal programming model for XML as one that is similar to working with text in Perl, where one can process streams of text by matching items of interest using regular expressions. Below is an excerpt from Tim Bray's article showing his idealized programming model for XML streams.

while (<STDIN>) {
  next if (X<meta>X);
  if    (X<h1>|<h2>|<h3>|<h4>X)
  { $divert = 'head'; }
  elsif (X<img src="/^(.*\.jpg)$/i>X)
  { &proc_jpeg($1); }
  # and so on...
}

Tim Bray isn't the only one who yearned for this XML processing model. For the past few years, various people I work with have been working towards creating a programming model for processing streams of XML documents in a manner analogous to processing text streams with regular expressions. This article describes the culmination of this work—the XPathReader.

Finding Loaned Books: XmlTextReader Solution

To give a clear indication of the productivity gains from the XPathReader compared to existing XML-processing techniques with the XmlReader, I have created an example program that performs basic XML processing tasks. The following sample document describes a number of books I own and whether they are currently loaned out to friends.

 <books>
  <book publisher="IDG books" on-loan="Sanjay">
    <title>XML Bible</title>
    <author>Elliotte Rusty Harold</author>
  </book>
  <book publisher="Addison-Wesley">
    <title>The Mythical Man Month</title>
    <author>Frederick Brooks</author>
  </book>
  <book publisher="WROX">
    <title>Professional XSLT 2nd Edition</title>
    <author>Michael Kay</author>
  </book>
  <book publisher="Prentice Hall" on-loan="Sander" >
   <title>Definitive XML Schema</title>
   <author>Priscilla Walmsley</author>
  </book>
  <book publisher="APress">
   <title>A Programmer's Introduction to C#</title>
   <author>Eric Gunnerson</author>
  </book>
</books>
   

The following code sample displays the names of the persons that I've loaned books to, as well as which books I have loaned to them. The code samples should produce the following output.

Sanjay was loaned XML Bible by Elliotte Rusty Harold 
Sander was loaned Definitive XML Schema by Priscilla Walmsley

XmlTextReader Sample: 
using System; 
using System.IO; 
using System.Xml;

public class Test{


    static void Main(string[] args) {

      try{ 
      XmlTextReader reader = new XmlTextReader("books.xml");
      ProcessBooks(reader);

      }catch(XmlException xe){
        Console.WriteLine("XML Parsing Error: " + xe);
      }catch(IOException ioe){
        Console.WriteLine("File I/O Error: " + ioe);
      }
    }  

    static void ProcessBooks(XmlTextReader reader) {
      
      while(reader.Read()){
      
        //keep reading until we see a book element 
        if(reader.Name.Equals("book") && 
      (reader.NodeType == XmlNodeType.Element)){ 
          
     if(reader.GetAttribute("on-loan") != null){ 
            ProcessBorrowedBook(reader);
          }else {
            reader.Skip();
          }
        }
      }
    }


   static void ProcessBorrowedBook(XmlTextReader reader){

 Console.Write("{0} was loaned ", 
                             reader.GetAttribute("on-loan"));
      
      
      while(reader.NodeType != XmlNodeType.EndElement && 
                                            reader.Read()){
       
       if (reader.NodeType == XmlNodeType.Element) {
          
     switch (reader.Name) {
            case "title":              
              Console.Write(reader.ReadString());
              reader.Read(); // consume end tag
              break;
            case "author":
              Console.Write(" by ");
              Console.Write(reader.ReadString());
              reader.Read(); // consume end tag
              break;
          }
        }
      }
      Console.WriteLine();
    }
}       

Using XPath as Regular Expressions for XML

The first thing we need is a way to perform pattern matching for nodes of interest in an XML stream in the same way we can with regular expressions for strings in a text stream. XML already has a language for matching nodes called XPath, which can serve as a good starting point. There is an issue with XPath that prevents it from being used without modification as the mechanism for matching nodes in large XML documents in a streaming manner. XPath assumes the entire XML document is stored in memory and allows operations that would require multiple passes over the document, or at least would require large portions of the XML document be stored in memory. The following XPath expression is an example of such a query:

/books/book[author='Frederick Brooks']/@publisher

The query returns the publisher attribute of a book element if it has a child author element whose value is 'Frederick Brooks'. This query cannot be executed without caching more data than is typical for a streaming parser because the publisher attribute has to be cached when seen on the book element until the child author element has been seen and its value examined. Depending on the size of the document and the query, the amount of data that has to be cached in memory could be quite large and figuring out what to cache could be quite complex. To avoid having to deal with these problems a co-worker, Arpan Desai, came up with a proposal for a subset of XPath that is suitable for forward-only processing of XML. This subset of XPath is described in his paper An Introduction to Sequential XPath.

There are several changes to the standard XPath grammar in Sequential XPath, but the biggest change is the restriction in the usage of axes. Now, certain axes are valid in the predicate, while other axes are valid only in the non-predicate portion of the Sequential XPath expression. We have classified the axes into three different groups:

  • Common Axes: provide information about the context of the current node. They can be applied anywhere in the Sequential XPath expression.
  • Forward Axes: provide information about nodes ahead of the context node in the stream. They can only be applied in the location path context because they are looking for 'future' nodes. An example is "child.'' We can successfully select the child nodes of a given path if "child" is in the path. However, if "child" were in the predicate, we would not be able to select the current node because we cannot look ahead to its children to test the predicate expression and then rewind the reader to select the node.
  • Reverse Axis: are essentially the opposite of Forward Axes. An example would be "parent." If parent were in the location path, we would want to return the parent of a specific node. Once again, because we cannot go backward, we cannot support these axes in the location path or in predicates.

Here is a table showing the XPath axes supported by the XPathReader:

Type Axes Where Supported
Common Axes attribute, namespace, self Anywhere in XPath expression
Forward Axes child, descendant, descendant-or-self, following, following-sibling Anywhere in XPath expression except predicates
Reverse Axis ancestor, ancestor-or-self, parent, preceding, preceding-sibling Not supported

There are some XPath functions not supported by the XPathReader due to the fact that they also require caching large parts of the XML document in memory or the ability to backtrack the XML parser. Functions such as count() and sum() are not supported at all, while functions such as local-name() and namespace-uri() only work when no arguments are specified (that is, only when asking for these properties on the context node). The following table lists the XPath functions that are either unsupported or have had some of their functionality limited in the XPathReader.

XPath Function Supported Subset Description
number last() Not Supported Cannot work without buffering
number count(node-set) Not Supported Cannot work without buffering
string local-name(node-set?) string local-name() Cannot use a node-set as a parameter
string namespace-uri(node-set?) string namespace-uri() Cannot use a node-set as a parameter
string name(node-set?) string name() Cannot use a node-set as a parameter
number sum(node-set) Not Supported Cannot work without buffering

The final major restriction made to XPath in the XPathReader is to disallow testing for the values of elements or text nodes. The XPathReader does not support the following XPath expression:

 /books/book[contains(.,'Frederick Brooks')]

The above query selects the book element if its string contains the text 'Frederick Brooks'. To be able to support such queries, large parts of the document may have to be cached and the XPathReader would need to be able to rewind its state. However, testing values of attributes, comments, or processing instructions is supported. The following XPath expression is supported by the XPathReader:

/books/book[contains(@publisher,'WROX')]

The subset of XPath described above is sufficiently reduced as to enable one to provide a memory-efficient, streaming XPath-based XML parser that is analogous to regular expressions matching for streams of text.

A First Look at the XPathReader

The XPathReader is a subclass of the XmlReader that supports the subset of XPath described in the previous section. The XPathReader can be used to process files loaded from a URL or can be layered on other instances of XmlReader. The following table shows the methods added to the XmlReader by the XPathReader.

Method Description
Match(XPathExpression) Tests whether the node in which the reader is currently positioned is matched by the XPathExpression.
Match(string) Tests whether the node in which the reader is currently positioned is matched by the XPath string.
Match(int) Tests whether the node in which the reader is currently positioned is matched by the XPath expression at the specified index in the reader's XPathCollection.
MatchesAny(ArrayList) Tests whether the node in which the reader is currently positioned on is matched by any of the XPathExpressions in the list.
ReadUntilMatch() Continues reading the XML stream until the current node matches one of the specified XPath expressions.

The following example uses the XPathReader to print the title of every book in my library:

using System; 
using System.Xml;
using System.Xml.XPath;
using GotDotNet.XPath;

public class Test{
static void Main(string[] args) {

      try{ 
XPathReader xpr  = new XPathReader("books.xml", "//book/title"); 

            while (xpr.ReadUntilMatch()) {
               Console.WriteLine(xpr.ReadString()); 
             }      
            Console.ReadLine(); 
   }catch(XPathReaderException xpre){
      Console.WriteLine("XPath Error: " + xpre);
      }catch(XmlException xe){
         Console.WriteLine("XML Parsing Error: " + xe);
      }catch(IOException ioe){
         Console.WriteLine("File I/O Error: " + ioe);
      }
   }  
}

An obvious advantage of the XPathReader over conventional XML processing with the XmlTextReader is that the application does not have to keep track of the current node context while processing the XML stream. In the example above, the application code doesn't have to worry about whether the title element whose contents it is displaying and printing is a child of a book element or not by explicitly tracking state because this is already done by the XPath.

The other piece of the puzzle is the XPathCollection class. The XPathCollection is the collection of XPath expressions that the XPathReader is supposed to match against. An XPathReader only matches nodes contained in its XPathCollection object. This matching is dynamic, meaning that XPath expressions can be added and removed from the XPathCollection during the parsing process as needed. This allows for performance optimizations where tests aren't made against XPath expressions until they are needed. The XPathCollection is also used for specifying prefix<->namespace bindings used by the XPathReader when matching nodes against XPath expressions. The following code fragment shows how this is accomplished:

XPathCollection xc  = new XPathCollection();
xc.NamespaceManager = new XmlNamespaceManager(new NameTable()); 
xc.NamespaceManager.AddNamespace("ex", "http://www.example.com"); 
xc.Add("//ex:book/ex:title"); 

XPathReader xpr  = new XPathReader("books.xml", xc); 

Finding Loaned Books: XPathReader Solution

Now that we've taken a look at the XPathReader, it's time to see how much improved processing XML files can be compared to using the XmlTextReader. The following code sample uses the XML file in the section entitled Finding Loaned Books: XmlTextReader Solution and should produce the following output:

Sanjay was loaned XML Bible by Elliotte Rusty Harold 
Sander was loaned Definitive XML Schema by Priscilla Walmsley

XPathReader Sample: 
using System; 
using System.IO; 
using System.Xml;
using System.Xml.XPath;
using GotDotNet.XPath;

public class Test{
static void Main(string[] args) {

      try{ 
         XmlTextReader xtr = new XmlTextReader("books.xml"); 
         
         XPathCollection xc = new XPathCollection();
         int onloanQuery = xc.Add("/books/book[@on-loan]");
         int titleQuery  = xc.Add("/books/book[@on-loan]/title");
         int authorQuery = xc.Add("/books/book[@on-loan]/author");

         XPathReader xpr  = new XPathReader(xtr, xc); 

         while (xpr.ReadUntilMatch()) {

            if(xpr.Match(onloanQuery)){
               Console.Write("{0} was loaned ", xpr.GetAttribute("on-loan"));
            }else if(xpr.Match(titleQuery)){
               Console.Write(xpr.ReadString());
            }else if(xpr.Match(authorQuery)){
               Console.WriteLine(" by {0}", xpr.ReadString());
            }

         }         

         Console.ReadLine(); 

   }catch(XPathReaderException xpre){
      Console.WriteLine("XPath Error: " + xpre);   
   }catch(XmlException xe){
         Console.WriteLine("XML Parsing Error: " + xe);
      }catch(IOException ioe){
         Console.WriteLine("File I/O Error: " + ioe);
      }
   }  
}

This output is greatly simplified from the original code block, is almost as efficient memory-wise, and very analogous to processing text streams with regular expressions. It looks like we have reached Tim Bray's ideal for an XML programming model for processing large XML streams.

How the XPathReader Works

The XPathReader matches XML nodes by creating a collection of XPath expressions compiled into an abstract syntax tree (AST) and then walking this syntax tree while receiving incoming nodes from the underlying XmlReader. By walking through the AST tree, a query tree is generated and pushed onto a stack. The depth of the nodes to be matched by the query is calculated and compared against the Depth property of the XmlReader as nodes are encountered in the XML stream. The code for generating the AST for an XPath expression is obtained from the underlying code for the classes in the System.Xml.Xpath, which is available as part of the source code in the Shared Source Common Language Infrastructure 1.0 Release.

Each node in the AST implements the IQuery interface that defines the following three methods:

        internal virtual object GetValue(XPathReader reader);
        internal virtual bool MatchNode(XPathReader reader);
        internal abstract XPathResultType ReturnType()

The GetValue method returns the value of the input node relative to the current aspect of the query expression. The MatchNode method tests whether the input node matches the parsed query context, while the ReturnType property specifies which XPath type the query expression evaluates.

Future Plans for XPathReader

Based on how useful various people at Microsoft have found the XPathReader, including the BizTalk Server that ships with a variation of this implementation, I've decided to create a GotDotNet workspace for the project. There are a few features I'd like to see added, such as integration of some of the functions from the EXSLT.NET project into the XPathReader and support for a wider range of XPath. Developers that would like to work on further development of XPathReader can join the GotDotNet workspace.

Conclusion

The XPathReader provides a potent way for processing XML streams by harnessing the power of XPath and combining it with the flexibility of the pull-based XML parser model of the XmlReader. The compositional design of System.Xml allows one to layer the XPathReader over other implementations of the XmlReader and vice versa. Using the XPathReader for processing XML streams is almost as fast as using the XmlTextReader, but at the same time is as usable as XPath with the XmlDocument. Truly it is the best of both worlds.

Dare Obasanjo is a member of Microsoft's WebData team, which among other things develops the components within the System.Xml and System.Data namespace of the .NET Framework, Microsoft XML Core Services (MSXML), and Microsoft Data Access Components (MDAC).

Howard Hao is a Software Design Engineer in Test on the WebData XML team and is the main developer of the XPathReader.

Feel free to post any questions or comments about this article on the Extreme XML message board on GotDotNet.