Creating XML Web Services That Parse the Contents of a Web Page

The Web today exposes an immense quantity of information. Unfortunately, the majority of this data is easily interpreted only by human eyes reading it from a browser. XML Web services created using ASP.NET help improve this situation by providing an HTML parsing solution that enables developers to parse content from a remote HTML page and programmatically expose the resulting data. Once permission is obtained from the publisher of the Web site content, and assuming the layout of the content does not change, HTML parsing can then be used to expose XML Web services that clients can leverage.

Building an XML Web service that parses the contents of a Web page uses a different paradigm than building a typical XML Web service. An XML Web service that parses an HTML page is implemented through the creation of a service description, which is an XML document in the Web Services Description Language (WSDL). Within the service description, XML elements are added to specify both the input parameters and the data to return from the parsed HTML page. Specifying the data returned from the parsed HTML page is where the majority of the implementation is done, as that is where the instructions to parse the HTML content are specified. In order to add these XML elements and thus build an XML Web service that parses an HTML page, a developer must have an understanding of the layout of an XML document written in WSDL. For details on WSDL, see the WSDL specification at the W3C Web site (www.w3.org/TR/wsdl).

Specifying Input Parameters

Input parameters can be passed to the Web server if the HTML page being parsed accepts parameters that affect the contents of the returned HTML page.

To specify input parameters

  • Add <part> child XML elements to the <message> XML element in the service description representing the <input> operation for a given <portType>.

    Each <part> child element represents a parameter and has two attributes: name and type. The name attribute is the parameter name and the type attribute is the data type of the parameter. Complex types can be defined within an XSD schema in the types section of the service description, and then specified as the data type for a parameter.

    The following code example defines three input parameters named param1, param2, and param3 within the GetTitlesHttpGetIn <message> for the TitlesHttpGet <portType>.

    <definitions xmlns:s="http://www.w3.org/2001/XMLSchema" xmlns:soapenc="https://schemas.xmlsoap.org/soap/encoding/" targetNamespace="http://tempuri.org/" xmlns="https://schemas.xmlsoap.org/wsdl/" xmlns:myarray="http://tempuri.org/MyArrayType">
      <types>
        <s:schema targetNamespace="http://tempuri.org/MyArrayType">
          <s:complexType name="StringArray">
            <s:complexContent>
              <s:restriction xmlns:soapenc="https://schemas.xmlsoap.org/soap/encoding/" 
                 base="soapenc:Array">
                <s:sequence>
                  <s:element name="String" type="s:string" minOccurs="0" 
                     maxOccurs="unbounded" />
                </s:sequence>
              </s:restriction>
            </s:complexContent>
          </s:complexType>
        </s:schema>
      </types>
      <message name="GetTitlesHttpGetIn">    <part name="param1" type="s:string"/>    <part name="param2" type="s:string"/>    <part name="param3" type="myarray:StringArray"/>  </message>
      <portType name="TitlesHttpGet">
        <operation name="GetTitles">
          <input message="s0:GetTitlesHttpGetIn"/>
          <output message="s0:GetTitlesHttpGetOut"/>
        </operation>
      </portType>
    

Specifying the Data to Return from the Parsed HTML Page

The data to return from a parsed HTML page is expressed within the service description using a series of XML elements containing regular expressions to parse specific pieces of data while providing a name for each piece of data. At the heart of each match XML element containing the parsing instructions is a .NET Framework regular expression. The .NET Framework regular expression provides an extensive pattern-matching notation that allows you to quickly parse large amounts of text to find specific character patterns. For details regarding the .NET Framework regular expression syntax, see .NET Framework Regular Expressions.

To specify the data returned from a parsed HTML page

  1. Add a namespace-qualified <text> XML element within the <output> element of the <operation> element for the desired <binding>.

  2. Add <match> XML elements in the service description within the <text> XML element for each piece of data you want to return from the parsed HTML page.

    Attribute Description
    name The class or property name representing the returned piece of data. A proxy class generated by the Wsdl.exe tool associates the name attribute with a class, if the <match> XML element has child <match> elements. The child <match> elements are mapped to properties of the class.
    pattern The regular expression pattern to use in order to obtain the piece of data. For details regarding the .NET Framework regular expression syntax, see .NET Framework Regular Expressions.
    ignoreCase Specifies whether the regular expression should be run case-insensitive. The default is case-sensitive.
    repeats Specifies the number of values that should be returned from the regular expression, in case the regular expression has multiple matches on the HTML page. A value of 1 returns only the first match. A value of -1 returns all matches. A value of -1 equates to a * in a regular expression. The default value is -1.
    group Specifies a grouping of related matches.
    capture Specifies the index of a match within a grouping.
    type Proxy classes generated using Wsdl.exe will use the type attribute as the name of the returned class for a <match> that contains child <match> elements. By default, a proxy class generated by Wsdl.exe will set the name of the returned class to the name specified in the name attribute.

    The following code example is a simple Web page sample containing <TITLE> and <H1> tags.

    <HTML>
     <HEAD>
      <TITLE>Sample Title</TITLE>
     </HEAD>
     <BODY>
        <H1>Some Heading Text</H1>
     </BODY>
    </HTML>
    

    The following code example is a service description that parses the contents of the HTML page, extracting the contents of the text within the <TITLE> and <H1> tags. In the code example, a TestHeaders method is defined for the GetTitleHttpGet binding. The TestHeaders method defines two pieces of data that can be returned from the parsed HTML page in <match> XML elements: Title and H1, which parse the contents of the <TITLE> and <H1> tags, respectively.

    <?xml version="1.0"?>
    <definitions xmlns:s="http://www.w3.org/2001/XMLSchema"
                 xmlns:http="https://schemas.xmlsoap.org/wsdl/http/"
                 xmlns:mime="https://schemas.xmlsoap.org/wsdl/mime/"
                 xmlns:soapenc="https://schemas.xmlsoap.org/soap/encoding/"
                 xmlns:soap="https://schemas.xmlsoap.org/wsdl/soap/"
                 xmlns:s0="http://tempuri.org/"
                 targetNamespace="http://tempuri.org/"
                 xmlns="https://schemas.xmlsoap.org/wsdl/">
      <types>
        <s:schema targetNamespace="http://tempuri.org/"
                  attributeFormDefault="qualified"
                  elementFormDefault="qualified">
          <s:element name="TestHeaders">
            <s:complexType derivedBy="restriction"/>
          </s:element>
          <s:element name="TestHeadersResult">
            <s:complexType derivedBy="restriction">
              <s:all>
                <s:element name="result" type="s:string" nullable="true"/>
              </s:all>
            </s:complexType>
          </s:element>
          <s:element name="string" type="s:string" nullable="true"/>
        </s:schema>
      </types>
      <message name="TestHeadersHttpGetIn"/>
      <message name="TestHeadersHttpGetOut">    <part name="Body" element="s0:string"/>  </message>
      <portType name="GetTitleHttpGet">
        <operation name="TestHeaders">
          <input message="s0:TestHeadersHttpGetIn"/>
          <output message="s0:TestHeadersHttpGetOut"/>
        </operation>
      </portType>
      <binding name="GetTitleHttpGet" type="s0:GetTitleHttpGet">
        <http:binding verb="GET"/>
        <operation name="TestHeaders">
          <http:operation location="MatchServer.html"/>
          <input>
            <http:urlEncoded/>
          </input>
          <output>         <text xmlns="https://microsoft.com/wsdl/mime/textMatching/">          <match name='Title' pattern='TITLE&gt;(.*?)&lt;'/>          <match name='H1' pattern='H1&gt;(.*?)&lt;'/>         </text>      </output>
        </operation>
      </binding>
      <service name="GetTitle">
        <port name="GetTitleHttpGet" binding="s0:GetTitleHttpGet">
          <http:address location="https://localhost" />
        </port>
      </service>
    </definitions>
    

    The following code example is a portion of the proxy class generated by Wsdl.exe for the previous service description.

    ' GetTitle is the name of the proxy class.
    Public Class GetTitle
      Inherits HttpGetClientProtocol
      Public Function TestHeaders() As TestHeadersMatches
         Return CType(Me.Invoke("TestHeaders", (Me.Url + _
              "/MatchServer.html"), New Object(-1) {}),TestHeadersMatches)
      End Function
    End Class
    Public Class TestHeadersMatches    Public Title As String    Public H1 As String
    End Class
    [C#]
    ' GetTitle is the name of the proxy class.
    public class GetTitle : HttpGetClientProtocol
    {
      public TestHeadersMatches TestHeaders() 
      {
            return ((TestHeadersMatches)(this.Invoke("TestHeaders", 
                     (this.Url + "/MatchServer.html"), new object[0])));
      }
    }    
    public class TestHeadersMatches 
    {
        public string Title;    public string H1;
    }
    

See Also

.NET Framework Regular Expressions | MatchAttribute Class | Building XML Web Services Using ASP.NET