Building a Desktop News Aggregator

 

Dare Obasanjo
Microsoft Corporation

Revised March 14, 2003

Summary: Dare Obasanjo builds a C# application that retrieves and displays news feeds from various Web sites. The application utilizes XPath, XSLT, XML Schema, the DOM, and XML Serialization in the .NET Framework. (12 printed pages)

Download the xml02172003_sample.exe.

Note The sample application associated with this article was updated on March 14, 2003. Significant upgrades to various features have been made and it is recommended that you upgrade earlier versions to this latest release.

Introduction

Like most people who spend time online, I have a number of Web sites I read on a daily basis. I recently noticed that I was checking an average of five to ten Web sites every other hour when I wanted to see if there were any new articles or updates to the content on a site. This prompted me to investigate the likelihood of creating a desktop application that would do all the legwork for me and alert me when new content appeared on my favorite Web sites. My investigations led to my discovery of RSS and the creation of my desktop news aggregator, RSS Bandit.

What Is RSS?

RSS is an XML format used for syndicating news and similar content from online news sources. RSS is used by news sites like C|Net and Wired, online technical journals like XML.com, and Web logs like Don Box's Spoutletex and Joel on Software.

An RSS feed is a regularly updated XML document that contains metadata about a news source and the content in it. Minimally an RSS feed consists of a channel that represents the news source, which has a title, link, and description that describe the news source. Additionally, an RSS feed typically contains one or more item elements that represent individual news items, each of which should have a title, link, or description.

Note The aforementioned elements appear in most RSS feeds but are not the only ones that can appear. Many RSS feeds also contain additional elements such as date or language. However, these elements and many others appear less commonly in practice.

Below is a sample RSS 0.91 feed for the MSDN XML Developer Center:

<rss version="0.91">
  <channel>
    <title>MSDN XML Developer Center</title>
    <link>https://msdn.microsoft.com/xml/</link>
    <description> Extensible Markup Language (XML) is the universal format
    for data on the Web. XML allows developers to easily describe and 
    deliver rich, structured data from any application in a standard, 
    consistent way. XML does not replace HTML; rather, it is a 
    complementary format. </description>
    <item>
      <title> XML Files: XPath Selections and Custom Functions, and More 
      </title>
      <link> 
      https://msdn.microsoft.com/msdnmag/issues/03/02/xmlfiles/TOC.asp
      </link>
      <description> Get your questions about XPath selections, custom 
      functions, and more answered in this month's column.  (February 5, 
      Article)</description>
    </item>
    <item>
      <title> Extreme XML: XML Serialization in the .NET Framework </title>
      <link> https://msdn.microsoft.com/library/en-us/dnexxml/html/xml01202003.asp
      </link>
      <description> Dare Obasanjo discusses XML serialization and how you 
      can use it within the .NET Framework to improve interoperability and 
      meet W3C standards.  (February 3, Article) </description>
    </item>
  </channel>
</rss>

For more information about RSS, read Mark Pilgrim's informative article entitled What is RSS? on XML.com.

Functional Requirements for the News Aggregator

RSS news aggregators are desktop or Web applications that are used to retrieve and display RSS feeds from various news sources. Examples of RSS news aggregators include NewzCrawler, NewsGator, and AmphetaDesk. I tried a few RSS news aggregators but didn't find one with the right mix of features and functionality for my tastes. Therefore, I decided to write one myself that met my needs.

I had the following functional requirements for my news aggregator

  1. The news aggregator must be able to process the three most popular versions of RSS (versions 0.91, 1.0 and 2.0).
  2. The news aggregator must use a three-paned user interface similar to Microsoft Outlook® Express for displaying RSS feeds.
  3. The news aggregator must use an embedded Web browser to allow viewing rich content and navigation to Web pages linked to from RSS items.
  4. The news aggregator must allow importation and exportation of a list of subscribed feeds using OPML, the standard format used by other aggregators.
  5. The news aggregator must provide the option to control how often each individual feed is checked.
  6. The news aggregator should provide keyboard shortcuts for common tasks like navigating through new items.
  7. The news aggregator must be able to track what messages have already been read between invocations of the application.
  8. The news aggregator should be able to show you the raw XML from a particular RSS feed.
  9. The news aggregator should cache RSS feeds on disk between invocations of the program.
  10. The news aggregator should provide the ability to mark read items as unread.
  11. The news aggregator should support ISA clients and/or Web proxies.
  12. The news aggregator must use HTTP conditional GET requests to reduce bandwidth costs on news sources.
  13. The news aggregator must be able to run on a system that meets the following minimum requirements:
    • Microsoft Windows® 2000, Windows XP, or above
    • Microsoft .NET Framework 1.0
    • LAN/Dialup Internet Connection

In implementing my RSS news aggregator, called RSS Bandit, I satisfied all of the aforementioned functional requirements except for number 9. Initial tests showed that there wasn't a significant performance difference between loading feeds from disk and refreshing from the Web on a broadband connection, although the former added some complexity to the code. Secondly, given that the expected usage pattern for RSS Bandit is as an always-on application, this feature did not seem absolutely necessary.

A Look at the RSS Bandit User Interface

The user interface for RSS Bandit is inspired by mail and newsreaders such as Microsoft Outlook and Microsoft Outlook Express. Figure 1 is a screenshot of RSS Bandit showing the embedded Web browser in action.

click for larger image

Figure 1. Reading News with RSS Bandit

Figure 2 is a screenshot that shows the popup message that indicates that new items have been retrieved.

click for larger image

Figure 2. Receiving New Messages with RSS Bandit

Overview of the RSS Bandit Architecture

The RSS Bandit application is primarily driven by two classes. The RssHandler class manages downloading of RSS feeds, while the RssBanditView class provides a graphical front end for viewing RSS feeds and interacting with the RssHandler class.

The RssHandler class downloads RSS feeds at specified intervals and stores them. The class is not tightly coupled to the user interface and can be reused by other applications that need to process RSS feeds. Clients that utilize the RssHandler class register a callback (delegate) upon instantiating the class. The RssHandler object then invokes the registered callback when new or updated feeds are downloaded. The information about which feeds to download and other configuration data is obtained from a feed subscription list written in XML. Since the amount of time between each download of a particular RSS feed is user configurable, the RssHandler class has a timer that goes off every five minutes and checks each feed to see whether enough time has elapsed between download attempts for that particular feed. This means that a feed cannot have more than one download attempt made against in a five-minute span.

The RssBanditView is a Windows Form that contains a tree view for displaying the list of subscribed feeds, a list view for displaying information about items from the currently selected feed, and an embedded Web browser for displaying content. On startup, the RssBanditView registers a delegate with the RssHandler that handles downloading and processing RSS feeds. Whenever new or updated feeds are downloaded, the RssBanditView is updated through the delegate in a thread-safe manner using techniques described in the Safe, Simple Multithreading in Windows Forms, Part 1 article by Chris Sells.

The user interface also enables the user to manage various aspects of the behavior of the RssHandler class. The user can add and remove feeds from the subscription list, configure how often feeds should be downloaded, and set proxy server information.

XML Technologies and RSS Bandit

The RSS Bandit application makes significant use of the XML technologies in the .NET Framework. RSS Bandit uses XML Serialization to convert the feed subscription list to objects and vice versa, XSLT to convert OPML files to the feed subscription list format, XSD validation to ensure the feed subscription list is valid, and XPath to process RSS feeds.

W3C XML Schema in RSS Bandit

The first step in working on RSS Bandit was deciding what information was necessary for the application to function on startup. After some brainstorming, I came up with two broad classes of information—feed subscriptions and configuration data. The application would need to be able to determine what feeds I wanted to read, how often it should fetch new items, and what news items were already read. Secondly the application would need to know information about the proxy server through which to direct Web requests.

After deciding what information the application needed on startup, I needed to decide between storing this information in a configuration file and storing it in the Windows registry. I decided to go with an XML configuration file over storing the data in the Windows registry for several reasons. Not only is an XML configuration file more portable than registry settings, but it also allows me to process my settings using the wide range of technologies for processing XML information.

Let's first take a look at the schema for the configuration file for RSS Bandit:

<xs:schema targetNamespace='http://www.25hoursaday.com/2003/RSSBandit/feeds/' 
xmlns:xs='http://www.w3.org/2001/XMLSchema' elementFormDefault='qualified' 
xmlns:f='http://www.25hoursaday.com/2003/RSSBandit/feeds/'>
  <xs:element name='feeds'>
    <xs:complexType>
      <xs:sequence>

        <xs:element name='feed' minOccurs='0' maxOccurs='unbounded'>
          <xs:complexType>
            <xs:sequence>
              <xs:element name='title' type='xs:string' />
              <xs:element name='link' type='xs:anyURI' />
              <xs:element name='refresh-rate' type='xs:int' minOccurs='0'>
                <xs:annotation>
                  <xs:documentation>
       This describes how often the feed must be refreshed in 
       milliseconds. 
      </xs:documentation>
                </xs:annotation>
              </xs:element>
              <xs:element name='last-retrieved' type='xs:dateTime' 
              minOccurs='0' />
              <xs:element name='etag' type='xs:string' minOccurs='0' />
              <xs:element name='stories-recently-viewed' minOccurs='0'>
                <xs:complexType>
                  <xs:sequence>
                    <xs:element name='story' type='xs:string' 
                    minOccurs='0' maxOccurs='unbounded' />
                  </xs:sequence>
                </xs:complexType>
              </xs:element>
            </xs:sequence>
            <xs:attribute name='category' type='xs:string' use='optional' />
          </xs:complexType>
        </xs:element>

   <xs:element name='categories' minOccurs='0'>
    <xs:complexType>
     <xs:sequence>
      <xs:element name='category' type='xs:string' maxOccurs='unbounded' />
     </xs:sequence>
    </xs:complexType>
   </xs:element>

      </xs:sequence>
      <xs:attribute name='refresh-rate' type='xs:int' use='optional' />
      <xs:attribute name='proxy-server' type='xs:string' use='optional' />
      <xs:attribute name='proxy-port' type='xs:positiveInteger' 
      use='optional' />
    </xs:complexType>      

    <xs:key name='categories-key'>
     <xs:selector xpath='f:categories/f:category'/>
     <xs:field xpath='.'/>
    </xs:key>
    
    <xs:keyref name='categories-keyref' refer='f:categories-key' >
     <xs:selector xpath='f:feed'/>
     <xs:field xpath='@category'/>
    </xs:keyref>
  
  </xs:element>
</xs:schema>

On startup, the first thing the application tries to do is find a file named feeds.xml in the current directory. If the file is found, then it is loaded and validated against the above schema to ensure it is a valid feed subscription list. Similar validation occurs if an attempt is made to import a feed subscription list during the execution of the RSS Bandit application.

Most of the information in the schema is straightforward although the refresh-rate, etag, and story elements could do with some clarification. The refresh-rate element describes how often an attempt should be made to download a feed in seconds. The etag element contains information from the ETag header that was sent back from the Web server the last time the feed was downloaded. This information is used when performing HTTP conditional GET requests. The story element contains the link to the news item, which doubles as a unique identifier used for distinguishing read versus unread stories.

The key constraint specifies that the each category element within categories must be unique and can be referenced as a key by another element or attribute. The keyref constraint specifies that the category attribute of a feed must have the same value as one of the category elements under categories.

XML Serialization in RSS Bandit

The information in the feed subscription list has to be accessed and modified quite often during the execution of RSS Bandit, which tends to favor storing the information in native data structures instead of in an XML document. For this reason, upon successful validation the feed subscription list is converted to objects using the XML Serialization technology described in last month's column, XML Serialization in the .NET Framework.

Below is the class that maps to the anonymous complex type that acts as the type definition of the feed element:

   /// <remarks/>
      [System.Xml.Serialization.XmlTypeAttribute
   (Namespace="http://www.25hoursaday.com/2003/RSSBandit/feeds/")]
   public class feedsFeed {
    
      /// <remarks/>
      public string title;
    
      /// <remarks/>
      [System.Xml.Serialization.XmlElementAttribute(DataType="anyURI")]
      public string link;
    
      /// <remarks/>
      [System.Xml.Serialization.XmlElementAttribute("refresh-rate")]
      public int refreshrate;
    
      /// <remarks/>
      [System.Xml.Serialization.XmlIgnoreAttribute()]
      public bool refreshrateSpecified;
    
      /// <remarks/>
      [System.Xml.Serialization.XmlElementAttribute("last-retrieved")]
      public System.DateTime lastretrieved;
    
      /// <remarks/>
      [System.Xml.Serialization.XmlIgnoreAttribute()]
      public bool lastretrievedSpecified;
    
      /// <remarks/>
      public string etag;
    
      /// <remarks/>
      [System.Xml.Serialization.XmlIgnoreAttribute()]
      public bool containsNewMessages; 

      /// <remarks/>
      [System.Xml.Serialization.XmlArrayAttribute(ElementName = 
      "stories-recently-viewed", IsNullable = false)]
      [System.Xml.Serialization.XmlArrayItemAttribute("story", Type = 
      typeof(System.String), IsNullable = false)]
      public ArrayList storiesrecentlyviewed;

      /// <remarks/>
      [System.Xml.Serialization.XmlAttributeAttribute("category")]
      public string category;
   }

The two attributes annotating the storiesrecentlyviewed viewed property are the most interesting annotations in the class. The annotations basically state that the element named stories-recently-viewed maps to an ArrayList while its child story elements map to strings stored in the ArrayList.

XPath and the DOM in RSS Bandit

As mentioned earlier, an RSS feed contains one or more item elements that optionally have title, link, or description elements as children. However, how to locate these elements differs depending on which version of RSS an application is processing. In RSS 0.91, the item element is a child of the channel element and neither element has a namespace name. In RSS 1.0, the item element is part of the "http://purl.org/rss/1.0/" namespace and it is a child of the RDF element. The RDF element itself belongs to the "http://www.w3.org/1999/02/22-rdf-syntax-ns#" namespace. In RSS 2.0, the item element is a child of the channel element and either element can either have no namespace name or if one exists, they have the same one.

I decided to abstract away from the aforementioned differences in the various flavors of RSS and create a class that represented the typical information in an RSS item. During processing of RSS feeds, the RssHandler class retrieves the item elements from an RSS feed and converts them to RssItem objects. Below is a code fragment showing how to locate all the item elements from an RSS feed stored in an XmlDocument regardless of what version of RSS being processed.

string rssNamespaceUri = String.Empty; 

         if(feed.DocumentElement.LocalName.Equals("RDF") &&
            feed.DocumentElement.NamespaceURI.Equals 
            ("http://www.w3.org/1999/02/22-rdf-syntax-ns#")){
            
            rssNamespaceUri = "http://purl.org/rss/1.0/";

         }else if(feed.DocumentElement.LocalName.Equals("rss")){

               rssNamespaceUri = feed.DocumentElement.NamespaceURI;             
         }else{
            throw new ApplicationException("This XML document does not 
            look like an RSS feed");
         }

            
         //convert RSS items in feed to RssItem objects and add to list
         XmlNamespaceManager nsMgr = new 
         XmlNamespaceManager(feed.NameTable); 
         nsMgr.AddNamespace("rss", rssNamespaceUri); 

         foreach(XmlNode node in feed.SelectNodes("//rss:item", nsMgr)){      
            RssItem item = MakeRssItem((XmlElement)node);
            items.Add(item);                
         }//foreach

The above code fragment loops through each item element in the XML document independent of whether the RSS version being processed is 0.91, 1.0, or 2.0. Similar code is used to process the child elements of each item element when converting it to an RssItem object. The application code takes advantage of the fact that although RSS feeds from different versions are structured differently, the item elements in them are similar (that is, there are shared islands of structure).

XSLT in RSS Bandit

Popular news aggregators including Aggie, AmphetaDesk, and NewsGator support importing and exporting RSS feed subscriptions using an XML format known as OPML. Because interoperability is always a good thing, I decided to support converting my feed subscription list format to OPML and vice versa.

Converting my feed subscription list to OPML turned out to be fairly straightforward because I just needed to write out some of the information from the objects generated from the feed subscription list as XML. The code is shown below:

   StringBuilder sb = new StringBuilder("<opml>\n<body>\n"); 
            
            if(_feedsTable != null){

               foreach(feedsFeed f in _feedsTable.Values){
                  sb.AppendFormat("<outline title='{0}' xmlUrl='{1}' 
                  />\n", f.title, f.link);
               }
            }            
   sb.Append("</body>\n</opml>");

However, converting the OPML files to my feed subscription list format required a bit more work. I decided the best route would be to use technology explicitly designed for converting between XML formats, which is XSLT. Below is the stylesheet that converts OPML files to my feed subscription list format:

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" indent="yes" />
  <xsl:template match="/">
    <feeds xmlns="http://www.25hoursaday.com/2003/RSSBandit/feeds/">
      <xsl:for-each select="/opml/body/outline">
        <feed>
          <title>
            <xsl:value-of select="@title" />
          </title>
          <link>
            <xsl:choose>
              <xsl:when test="@xmlUrl">
                <xsl:value-of select="@xmlUrl" />
              </xsl:when>
              <xsl:when test="@xmlurl">
                <xsl:value-of select="@xmlurl" />
              </xsl:when>
              <xsl:otherwise>ERROR: No RSS Feed URL in OPML 
              File</xsl:otherwise>
            </xsl:choose>
          </link>
        </feed>
      </xsl:for-each>
    </feeds>
  </xsl:template>
</xsl:stylesheet>

Once the OPML file is converted to the RSS Bandit feed subscription list format, it is merged with the internal representation of the feed subscription list processed at startup.

Future Plans for RSS Bandit

I currently use RSS Bandit on a daily basis and find it quite useful. Before publishing this article, I made the installer available on GotDotNet and it has been downloaded 1000 times. Given the positive feedback about RSS Bandit, I have created a GotDotNet Workspace for the project and will work with others to continue developing it. There are a number of features I'd like to add, such as support for caching RSS feeds on disk, providing feedback when an RSS feed is invalid, implementing RSS autodiscovery, support for embedding RSS Bandit in Internet Explorer, and automatic updating of the application using either the Background Intelligent Transfer Service API or the .NET Application Updater Component. Developers that would like to work on further development of RSS Bandit can join the GotDotNet workspace.

Locating RSS Feeds

The RSS Bandit installer places a number of feed subscription lists in a subdirectory of the RSS Bandit application on installation. These lists contain feeds for technology news sites, XML-centric news sources, and developer Web logs. Good places to start when searching for more RSS feeds include News Is Free or Syndic8 to see if your favorite Web sites offer syndication.

Dare Obasanjo is a member of Microsoft's WebData team, which among other things develops the components within the System.Xml and System.Data namespace of the .NET Framework, Microsoft XML Core Services (MSXML), and Microsoft Data Access Components (MDAC).

Feel free to post any questions or comments about this article on the Extreme XML message board on GotDotNet.