Add Custom Data Mining Algorithms to SQL Server 2005
SQL Server 2005 Analysis Services
Summary: Get a high-level overview of the benefits of the extensibility framework in SQL Server 2005 Analysis Services that allows independent software developers to easily integrate new data mining algorithms into the product. (7 printed pages)
Data mining in SQL Server 2005 is a platform that independent software developers can build on in many ways. One facet of the platform vision of the architects of SQL Server Data Mining is the ability of third parties to develop their own specialized data mining algorithms and plug them into the Analysis Services infrastructure in SQL Server 2005.
As an independent data mining algorithm developer, you not only have to design and implement the complex logic for building and navigating your models, you also need to worry about the ability to read raw data from various data sources, transform it into a format that is usable by the mining algorithm code, and finally present the results to the user in a form that they can comprehend. Note that we have not even talked about common enterprise requirements like deployment to multiple users, secure storage and access control, multi-user querying and programmability. This is where building on top of a platform like SQL Server 2005 Data Mining proves hugely advantageous.
By integrating at a very low level into the data-mining engine, you are freed from implementing:
- Data access modules for various kinds of data sources for training and prediction
- Tokenization of user data to and from the numeric attribute space that is typically used by most data mining algorithm implementations
- A definition and query language with an associated parser and execution engine for creating, training and accessing mining models
- Secure storage of mining model and related objects on disk
- Fine-grained authentication and authorization mechanisms for multi-user access to mining models
- A scalable training and querying engine
The benefits are not just limited to the resource savings from not having to develop, test, and support the infrastructure pieces described above—you automatically gain a slew of features simply by virtue of being a plug-in algorithm:
- A complete design and development environment, in the form of Business Intelligence Development Studio (BIDS), which includes tools for all phases of data mining model development: data exploration widgets, model creation wizards, designers, model visualizers, model scoring and comparison tools and a prediction query builder. BIDS allows developers and analysts to develop and refine mining models iteratively and then deploy them for multi-user access to a production server.
- A management environment in the form of SQL Server Management Studio. SSMS allows administrators to centrally manage storage and security for mining models using a familiar database-oriented paradigm.
- Deep integration with OLAP that provides features like the ability to directly mine OLAP cubes and to build data-mining dimensions
- Data-mining components in SQL Server 2005 Integration Services for complex data preparation and text mining
- The ability to reach the large community of database developers through familiar programming models like ADO and ADO.NET. Plug-in algorithms are also accessible at the same level as native SQL Server 2005 Data Mining algorithms through the new client and server-side ADOMD.NET programming models, as well as through AMO, the new object model for managing Analysis Services objects
- The ability to query mining models based on your plug-in algorithm via DMX, a powerful SQL-based language that also supports stored procedures and user-defined functions written in a .NET language of your choice
- The ability to easily produce and distribute data mining results through SQL Server 2005 Reporting Services
- The ability to allow users to access your mining models as a Web service, since Analysis Services 2005 is a native XMLA (XML for Analysis) server that can be accessed via TCP or HTTP
- The ability to plug your custom viewers for both native and plug-in algorithms into the SQL Server 2005 development and management environments.
- Interoperability with other data-mining products via PMML.
- Access to the large installed base of SQL Server licensees
Figure 1. Where Data Mining plug-ins fit into Analysis Services 2005
In the following sections, we will drill down into the various implementation pieces that make it possible for SQL Server Data Mining users to not only seamlessly build, train, browse, and deploy models based on your plug-in data mining algorithms, but also embed them into their custom applications.
The Data Mining Engine in SQL Server 2005 Analysis Services communicates with plug-in algorithms via a set of COM interfaces that are available in a public header file. These are grouped into two categories: interfaces implemented by plug-in algorithms and interfaces implemented by the Data Mining Engine and consumed by plug-in algorithms.
This means that a plug-in algorithm is implemented as a COM DLL that is loaded into the Analysis Server process.
The interaction between the Data Mining Engine and a plug-in algorithm via these COM interfaces falls into one of the five types:
- Metadata Query: Discovering the capabilities of the plug-in algorithm
- Training: Iterating a set of cases and extracting knowledge, rules, or patterns
- Content Browsing: Navigating the rules/patterns learned by the algorithm
- Prediction: Applying the rules/patterns learned by the algorithm to new input cases. This is commonly known as scoring.
- Persistence: Loading/saving learned content
All mining model and related storage objects live in on-disk databases managed by the Data Mining Engine (Analysis Server) storage engine subsystem. The Data Mining Engine allows you to store your algorithm-specific content as part of its corresponding mining model object using persistence interfaces that it makes available to you. This means that you simply write to and read from these streaming interfaces without needing to worry about developing your own transactional storage system. You are of course free to define and manage your own in-memory structures that you load from the Analysis Server storage via these interfaces.
Memory objects—such as strings, variants, and arrays—must be allocated and freed using per-algorithm or per-request service provider interfaces passed to plug-in algorithms by the Data Mining Engine. This allows the Data Mining Engine to efficiently manage memory resources and balance them across multiple requests. As a plug-in algorithm developer, this is another complex area that you no longer need to spend development effort on.
Error reporting is done through the standard COM mechanism of creating and populating errorinfo objects. The Data Mining Engine consumes your errorinfo objects and raises your errors in the same way that it reports errors from other external components. Like native errors, your errors may be logged to the console, administrator-specified server log files, or the Windows event log.
Your algorithm is registered like a standard COM DLL using the Windows regsvr32 utility. In addition, you need to edit the Analysis Server configuration file to add an entry for your algorithm and point it your COM server's ProgID. This allows Analysis Server administrators to control which plug-in algorithms can be loaded into the Analysis Server process.
In addition to server-side plug-in algorithms, the SQL Server Data Mining Plug-in Framework allows you to build custom viewers that plug into Business Intelligence Development Studio and into SQL Server Management Studio.
Plug-in viewers are Winforms UserControls that implement a simple mining viewer interface. They can be implemented in any .NET language.
Your plug-in viewer can be associated with one or more plug-in or native Microsoft algorithms via the Windows registry.
The deep, low-level integration of plug-in algorithms and viewers enables a uniform user experience at all levels of interaction with the SQL Server Data Mining platform. This means that each of the following classes of users sees no difference between a native (Microsoft-supplied) algorithm/viewer and a plug-in algorithm/viewer:
- Users of Business Intelligence Development Studio (analysts, developers)
- Users of SQL Management Studio (administrators/DBAs)
- Users of client- and server-side programming models (developers)
For example, a registered plug-in algorithm would simply show up in the list of algorithms displayed by the Data Mining Wizard as shown below:
Figure 2. List of data-mining algorithms
Once the user picks an algorithm from this list, the rest of the wizard follows the same path that is taken for a built-in algorithm.
The following resources are available to help you develop plug-in algorithms quickly:
- A technical white paper that describes the implementation architecture including interface descriptions, as well as control and data flow in detail
- Tutorials that step you through the process of creating your plug-in algorithm and viewer, including shell code that you can re-use
- Complete C++ source code for a sample plug-in algorithm implementation
Links to these resources are included in the References section below.
The plug-in algorithm architecture in SQL Server 2005 Data Mining enables you to invest your development resources in new, innovative mining algorithms instead of infrastructure work. By plugging into SQL Server 2005 Data Mining, you can take your single-user, niche product to a new market—the enterprise—and share in the growth of the Microsoft database and business intelligence platform.
SQL Server Data Mining: Plug-In Algorithms by Raman Iyer and Bogdan Crivat
A Tutorial for Constructing a Plug-In Algorithm by Max Chickering and Raman Iyer (on SQLServerDataMining.com—requires user login)
A Tutorial for Constructing a Plug-In Viewer by Jesper Lind and Scott Oveson (on SQLServerDataMining.com—requires user login)