A Tutorial for Constructing a Plug-in Algorithm

 

Max Chickering and Raman Iyer
Microsoft Corporation

August 2004

Applies To:
   Microsoft SQL Server 2005 Analysis Services
   Microsoft Visual C++

Summary: Learn how to create an Analysis Services 2005 plug-in algorithm. This tutorial steps through the process of implementing a plug-in algorithm and integrating that algorithm into Analysis Services. It also provides stub code to enable algorithm developers to quickly integrate a "shell" plug-in algorithm into Analysis Services. (45 pages)

Contents

Overview
The Big Picture
The Model: Pair-wise Linear Regression Section
The Important Supplied Files Section
The Important Preliminaries Section
The Building a Shell Plug-in Algorithm Section
The Customizing the Algorithm: Pair-wise Linear Regression Section
The Using the Customized Plug-in Algorithm Section
Making Predictions
Conclusion

Overview

Since Service Pack 1 of Microsoft SQL Server Analysis Services 2000, Analysis Services has allowed the integration of third-party OLE DB for Data Mining providers: third-party developers can create their own data mining algorithms to be used directly by Analysis Services. We call any such algorithm a plug-in algorithm.

In SQL Server 2005 Beta 1 Analysis Services, integration of plug-in algorithms has been simplified significantly. In particular, you only need to implement a handful of COM interfaces to incorporate a plug-in algorithm into Analysis Services.

This tutorial steps you through the process of implementing a plug-in algorithm and integrating that algorithm into Analysis Services. With the stub code that is provided with this tutorial, you should be able to have a "shell" plug-in algorithm integrated into Analysis Services within an hour. This means that developers of data mining algorithms can concentrate their efforts on their algorithms, as opposed to worrying about integration with Analysis Services.

This tutorial does not document all of the details of the interfaces, but rather provides sufficient information for most algorithm developers. Those interested in more details about plug-in algorithms and OLE DB for Data Mining should refer to the following documents:

There are a number of files that accompany this tutorial, and we assume that these are located in your file system in a location that we will refer to as SRC. For example, if you installed these files in the directory C:\temp, then SRC will refer to that directory. There should be six sub-directories of the SRC directory: CompletedDemo, CompletedShell, CustomCode, Demo, StubCode, and Utilities. We assume that all of the source code needed to compile your project resides in a single directory that we will refer to as PRJ.

This document is organized as follows. In Section 2, we provide a high-level overview of (a) the interfaces that need to be implemented by a plug-in algorithm and (b) the architecture that we chose for the implementation in this tutorial.

In Section 3, we describe the pair-wise linear regression model for which we will be constructing the plug-in algorithm; this is a non-trivial model that has proven to be useful for analyzing multiple real-world problems.

In Section 4, we describe the set of C++ files that comes with this tutorial and that will be generally useful for writing plug-in algorithms.

In Section 5, we discuss some important issues about plug-in algorithms that should be considered before we begin the tutorial.

In Section 6, we guide you through the steps of building (starting with nothing but some stub code) a "shell" plug-in algorithm that integrates into the Analysis Services framework. Although the resulting shell algorithm does not do anything useful, it is a good starting point for further development. It should take about an hour to work through Section 6. For those who wish to skip this step, the final set of code is provided with this tutorial in the directory SRC\CompletedShell\.

In Section 7, we start from the shell plug-in algorithm from Section 6, and using code provided from the SRC\CustomCode\ directory, we implement (a) a learning algorithm, (b) a prediction method, and (c) a browsing method for the pair-wise linear regression model described in Section 3.

Finally, in Section 8, we demonstrate how to use the new algorithm within the Business Intelligence Development Studio application.

The Big Picture

Your plug-in algorithm needs to provide an implementation for five main COM interfaces:

  1. IDMAlgorithm: The algorithm interface both (a) implements a model-producing learning algorithm and (b) implements the prediction operations of the resulting model.
  2. IDMAlgorithmNavigation: The algorithm-navigation interface allows browsers to access the content of your model.
  3. IDMPersist: The persist interface allows the models learned by your algorithm to be saved and loaded by Analysis Services.
  4. IDMAlgorithmMetadata: The algorithm-metadata interface describes the capabilities and input parameters of your learning algorithm.
  5. IDMAlgorithmFactory: The algorithm-factory interface both (a) creates instances of the objects that implement the algorithm interface and (b) provides Analysis Services access to the algorithm-metadata interface.

Analysis Services uses these interfaces to learn a model from data as follows. At startup, Analysis Services uses an initialization file (called msmdsrv.ini) to determine what data mining algorithms are available. This file includes a list of all the Microsoft algorithms that ship with Analysis Services. After you build a DLL that implements your plug-in algorithm, you will add your algorithm to the list and specify the ProgID for the CoClass that implements the algorithm-factory interface. When faced with a data mining task, Analysis Services will create an instance of your factory object and use the algorithm-metadata interface (obtained through the algorithm-factory interface) to determine whether your algorithm is appropriate for the given task. Next, Analysis Services creates an instance of your algorithm using the algorithm-factory interface. The resulting algorithm interface is used to train the model and to provide access to the persist interface and the algorithm-navigation interface.

In Figure 1, we show the three C++ classes that we will be using in this tutorial to implement these interfaces. The FACTORY class implements both the algorithm-factory interface and the algorithm-metadata interface. This is the CoClass whose ProgID is provided in the initialization file of Analysis Services. The ALGORITHM class implements both the algorithm interface and the persist interface. This class also has static-member functions that implement the algorithm-metadata interface; the FACTORY implementation of IDMAlgorithmMetadata simply calls the corresponding static-member functions of ALGORITHM. We chose this architecture so that the capabilities of the mining algorithm could be "published" by the same class that implements that algorithm. Finally, the NAVIGATOR class implements the algorithm-navigation interface. Instances of this class have a pointer to the corresponding ALGORITHM class for which it is a navigator.

ms345112.pluginalg201(en-US,SQL.90).gif

Figure 1. C++ classes implementing the interfaces

The algorithm interface is responsible for supplying both the persist interface and the algorithm-navigation interface to Analysis Services upon request; when Analysis Services requests an algorithm-navigation interface, the ALGORITHM C++ class will create new instances of the NAVIGATOR C++ class and return the corresponding interface.

The Model: Pair-wise Linear Regression Section

In this tutorial, we will construct a plug-in algorithm that learns, for each output attribute, a pair-wise regression model for every input attribute. For example, suppose that our domain consists of three continuous attributes X, Y, and Z, each of which is both an input and an output. In this case, our algorithm will construct the following six regression models:

ms345112.pluginalg202(en-US,SQL.90).gif

Figure 2. Pair-wise regression models for attributes X, Y, and Z

That is, the algorithm will use the training data to learn a regression for each pair of variables: X as a function of Y; X as a function of Z; Y as a function of X; and so on. In practice, this means the algorithm estimates all of the mi and bi values, as well as the standard deviation for each regression.

To view our model, we will use the regression-tree viewer that is provided by Analysis Services. This viewer is intended to display decision trees in which the leaves contain linear regressions; we will construct our algorithm-navigation interface so that our model resembles a set of decision trees. In particular, for each output variable we will have a decision tree—containing no splits—in which the single linear regression formula is constructed by summing together all of the mi terms in the corresponding pair-wise models. For example, the tree corresponding to output attribute X from above would be a root node containing the following regression:

ms345112.pluginalg203(en-US,SQL.90).gif

Figure 3. Regression from the decision tree for attribute X (using linear coefficients)

where ms345112.pluginalg221(en-US,SQL.90).gif denotes the sample average of X, and the standard deviation of the regression is set to the (marginal) sample standard deviation of X. We use the average value of X as the offset because we have a set of bi values but only one offset to display in the viewer. The coefficients m1 and m2 from the two pair-wise models will typically be different from the coefficients that would result by learning the three-variable linear regression above. The following is what the browser will display using some sample data provided with this tutorial after we learn our model:

ms345112.pluginalg204(en-US,SQL.90).gif

Figure 4. Browser display for attribute X using linear coefficients

We provide a flag to the learning algorithm that allows the browser to display the correlation coefficients, instead of the linear coefficients mi. The correlation coefficients are often more useful to an analyst than the linear coefficients. The correlation coefficient rXY between X and Y can be computed from the regression X = m1 Y + b1 by simply multiplying m1 by the ratio of (a) the standard deviation of Y and (b) the standard deviation of X:

ms345112.pluginalg205(en-US,SQL.90).gif

Figure 5. Formula for the correlation coefficient between X and Y

Thus, when this flag is set, we display the regression:

ms345112.pluginalg206(en-US,SQL.90).gif

Figure 6. Regression from the decision tree for attribute X (using correlation coefficients)

Although the correlation coefficients are a simple function of the linear coefficients, we need to rebuild the model in order to display them, because there is no way for us to provide this information through the built-in browser (that is, the browser that we use does not have a way for us to pass in this information). After we specify that we would like to view correlations, we get the following display:

ms345112.pluginalg207(en-US,SQL.90).gif

Figure 7. Browser display for attribute X using correlation coefficients

Using our model, we make predictions by applying the naive-Bayesian inference. To predict variable X, given values of Y and Z, we use:

ms345112.pluginalg208(en-US,SQL.90).gif

Figure 8. Applying the naive-Bayesian inference to predict X

where p(X) is Gaussian and the two conditionals are the Gaussians obtained from our pair-wise linear models. It turns out that this posterior distribution is also a Gaussian, and for our example the posterior standard deviation and the posterior mean are given as follows:

ms345112.pluginalg209(en-US,SQL.90).gif

Figure 9. Formulas for the posterior standard deviation and posterior mean

Using the same model as above, we can predict the expected X for each case in the training data, given the values of Y and Z. The following is the output of this prediction task, showing the real values vs. the predicted values for the first three cases:

Table 1. Real values and predicted values of attribute X for cases 1- 3

Case Real X Predicted X
1 50.08256 45.46138
2 30.40568 34.39197
3 -6.89417 -9.25408

The Important Supplied Files Section

There are a number of files provided with this tutorial that will be useful not only for this tutorial, but for implementing other plug-in algorithms in the future.

Perhaps the most important supplied files are those located in the SRC\StubCode directory. These files contain the stub code for all of the interfaces; this code, which provides detailed comments for all of the functions, can be pasted easily into a new project to allow fast implementation of your algorithm. The names of these files indicate the name of the interface for which they provide the stub code: thus IDMAlgorithm.Stubs.h and IDMAlgorithm.Stubs.cpp contain, respectively, the stub declaration and implementation code for the IDMAlgorithm interface.

There are two sets of stub code for the IDMAlgorithmMetadata interface: one set for implementing this interface as static-member functions of the ALGORITHM C++ class (which is what we do in this tutorial), and another set for implementing this interface in the FACTORY C++ class. For this interface, the filenames are decorated with "StaticStubs" and "FactoryStubs" instead of "Stubs" as described above. If you use the static-member implementation, the file StaticAlgorithmMetadata.h contains the macro IMPLEMENT_STATIC_ALGORITHM_METADATA that you include in the FACTORY class declaration. This macro implements functions from the IDMAlgorithmMetadata interface in the FACTORY class by calling the corresponding static functions in the ALGORITHM class.

Assuming that you are using the static implementation of the IDMAlgorithmMetatdata interface, the files SRC\Utilities\ParamHandler.cpp and SRC\Utilities\ParamHandler.h define a class that automatically implements all of the parameter-handling functions; using this class simplifies development by taking care of the tedious parsing of strings that is necessary for implementing these functions.

The files SRC\Utilities\caseprocessor.h and SRC\Utilities\caseprocessor.cpp contain the declaration and implementation of the class CASEPROCESSOR. This class implements the IDMCaseProcessor interface, which is the method by which Analysis Services provides data to your algorithm. To use the CASEPROCESSOR class, you implement a derived class that implements either the function ProcessCaseSparse() or the function ProcessCaseDense(). Regardless of how the data is stored within Analysis Services, CASEPROCESSOR provides a view of each case that is either sparse (that is, a set of attribute/value pairs) or dense (that is, an array of values in one-to-one correspondence with the zero-indexed attributes), depending on what is most convenient for your algorithm.

An STL allocator class that can be used with Analysis Services' memory-management model is provided in SRC\Utilities\DmhAllocator.h, and a corresponding template class that derives from the STL vector class is provided in SRC\Utilities\DmhVector.h. The Dmh prefix to the file is short for "data mining helper"; there are a number of Dmh-prefixed files in SRC\Utilities\ that contain helper classes that will be generally useful for implementing any plug-in algorithm.

The Important Preliminaries Section

Dmalgo.h

The main header file describing the interfaces and data types used by plug-in algorithms is dmalgo.h, which can be found in the SRC directory. It may be useful to skim through the definitions of the five main interfaces so that you have some basic familiarity with the types of functions these interfaces support. When implementing other plug-in algorithms, you will probably refer to this header frequently.

COM

We will be using the Active Template Library (ATL) to implement the COM interfaces in this tutorial. Although not essential, it would be useful to have a fairly good understanding of how COM works and at least some familiarity with ATL. For those interested in references on these topics, we recommend Essential COM by Don Box and ATL COM Programmer's Reference by Richard Grimes.

Threading

Your plug-in algorithm and Analysis Services will be communicating with each other through COM interfaces. To avoid marshalling these interface calls, which will significantly slow down model training, you should use the same threading model as the interfaces exposed by Analysis Services.

Analysis Services uses the free-threaded model, and thus it is important for you to use this model as well. You are guaranteed, however, that the two main algorithm-training functions, IDMAlgorithm::Initialize() and IDMAlgorithm::InsertCases(), will never be called by multiple threads simultaneously. In other words, you need not engineer your model-training code to be re-entrant, despite the fact that you are using the free-threaded model.

In general, plug-in algorithms should avoid creating new threads, because the server has its own thread-management and pooling mechanism. Windows threads created in the server process, outside the control of the server's thread pool, will adversely impact the server's ability to manage threads.

Memory Management

All memory allocations by a plug-in algorithm must be made using the memory-management interfaces supplied by Analysis Services. This takes some getting used to, but it is essential for properly integrating with Analysis Services.

One exception to this rule is that you need to call ::SysAllocString() to allocate space for BSTR strings in a VARIANT.

The memory-management interface (IDMMemoryAllocator) that you use at any particular time depends on the lifetime of the memory that you will be allocating. The two choices are to allocate memory that lasts (a) for the length of a function call or (b) for the lifetime of a model. Almost every function call defined by the plug-in-algorithm interfaces takes, as input, a "context-services interface" (IDMContextServices). From this interface, you can obtain a memory-allocator interface to be used to allocate memory that will only last the lifetime of the function call. When your model is created by the IDMAlgorithmFactory interface, you will have access to a "model-services interface" (IDMModelServices) that can be cached away with your model. This interface can be queried for a memory-management interface to be used for allocating memory that can last as long as your model.

For this tutorial, we provide the helper class DMHALLOC, which contains an ATL smart pointer to the memory-management interface. The main benefit of this class is that you can initialize it either with a context-services interface, with a model-services interface, with a memory-management interface, or with another DMHALLOC. The ALGORITHM class shown in Section 2 derives from this class. In the header file SRC\Utilities\DmhMemory.h, where DHMALLOC is defined, we also provide placement new and placement delete operators that use the DMHALLOC class. You can also create COM CoClass objects that derive from DMHALLOC by using the CComObjectDM template in DmhMemory.h.

STL

In this tutorial, we make frequent use of the STL vector template class. Because of the memory-management requirements discussed in Section 5.4, the tutorial provides a derived template class dmh_vector that uses a special STL allocator (dmh_allocator). This allocator uses a DMHALLOC class to allocate memory. We require a DMHALLOC class upon construction of a dmh_vector, which means that any class that contains an STL vector as a member variable must have access to a DMHALLOC upon construction.

Error Handling

Plug-in algorithms should pass errors back to the server using both HRESULTs and by raising errors via the global function ::SetLastError (or indirectly through ::AtlReportError). Plug-in algorithms should not throw exceptions that they do not handle themselves. Because some STL functions throw when they cannot allocate enough memory, we need to wrap any such function call within a try/catch block to make sure the exception is not passed back to the Analysis Server. We accomplish this in the demo using the macro CHECK_STL_MEM (see DmhVector.h).

In the current SQL Server 2005 beta release, errors raised using ::SetLastError are not reported by Analysis Services, and consequently we only report errors using HRESULTs. However, in anticipation of adding better error handing to the demo, we implement the ISupportErrorInfo interface on our main objects.

The Building a Shell Plug-in Algorithm Section

In this section, we step you through the process of creating a plug-in algorithm that integrates into Analysis Services. We assume that you are using Microsoft Visual Studio version 7.0 or later.

This section will show you in detail all the steps required to build a plug-in shell from scratch; following along with each step should help increase your understanding of the Analysis Services architecture. If you decide to skip this section, the code that results is provided in the SRC\CompletedShell directory.

Step 1: Creating the project

  1. Open Visual Studio, and on the File menu select New, and then select Project.

  2. Open the Visual C++ Projects folder and select ATL Project from the ATL folder. Select an appropriate location, which we will refer to as PROJ, and call the project "PlugIn". Click OK.

  3. Select the Applications Settings tab and clear the Attributed check box. (This step is not necessary, but it allows compatibility with earlier versions of Visual Studio.) Leave the Server Type as DLL, and check Allow merging of proxy/stub code.

    The dialog should look like this:

    ms345112.pluginalg210(en-US,SQL.90).gif

    Figure 10. Configuring the Application Settings for the project "PlugIn"

  4. Click Finish to create the project.

  5. Copy the following files from the SRC\Utilities\ directory into the location of your project, and then add them to your project by selecting Add Existing Item on the Project menu:

    • DmhMemory.h and DmhMemory.cpp

      These files declare and implement the memory-management helper class DMHALLOC and the corresponding placement new and placement delete operations.

    • DmhLocalization.h and DmhLocalization.cpp

      These files simply declare and implement the global function ::LoadStringFromID(), which allows your plug-in algorithm to retrieve localizable strings. The default implementation of this function simply returns the string version of the ID; it is up to you to re-implement as appropriate. For example, if you are developing an algorithm that will be used in different countries, you will probably want this function to load strings from a resource file.

    • ParamHandler.h, and ParamHandler.cpp

      These files provide free implementation of the messy parameter-handling functions from IDMAlgorithmMetadata.

  6. Copy the file StaticAlgorithmMetadata.h from SRC\StubCode\ to the main project directory PROJ. You need this file if you are going to be implementing the algorithm-metadata interface using static-member functions on the algorithm class. We recommend this option for two reasons. First, it keeps the functions describing the algorithm in the same location as the algorithm itself. Second, using this approach you can get free implementations of the messiest IDMAlgorithmMetadata functions.

  7. Copy the files dmalgo.h and oledbdm.h from the top-level SRC directory to the main project directory.

Step 2: Creating the ALGORITHM class

Now we will create the ALGORITHM class to implement both the IDMAlgorithm and the IDMPersist interfaces.

  1. Select Add Class on the Project menu.

  2. Expand the Visual C++ folder, click on the ATL folder, select ATL Simple Object from the Templates panel, and click Open.

  3. Type in the name of your algorithm in the Short name edit box. We used the short name ALGORITHM, and then removed the "C" from the Class name so that the class name is also ALGORITHM.

  4. Change the interface from IALGORITHM to IDMAlgorithm.

  5. Click the Options tab. Choose Free for the threading model, choose Custom for the interface, and select the ISupportErrorInfo check box.

  6. Click Finish.

  7. In the resulting header file, which we assume is named ALGORITHM.h, include the following headers directly beneath resource.h:

    #include "DmhMemory.h"
    #include "oledbdm.h"
    #include "DmhLocalization.h"
    
  8. Also in ALGORITHM.h, specify that the ALGORITHM class inherits from IDMPersist and DMHALLOC:

    class ATL_NO_VTABLE ALGORITHM : 
             public CComObjectRootEx<CComMultiThreadModel>,
             public CComCoClass<ALGORITHM, &CLSID_ALGORITHM>,
             public ISupportErrorInfo,
             public IDMAlgorithm,
             public IDMPersist,
       public DMHALLOC
    
  9. Add the following between BEGIN_COM_MAP and END_COM_MAP:

    COM_INTERFACE_ENTRY(IDMPersist)
    
  10. In ALGORITHM.h, in the class definition of ALGORITHM, paste in the contents of both IDMAlgorithm.Stubs.h and IDMPersist.Stubs.h (located in the SRC\StubCode\ directory). This text contains the header declarations for the IDMAlgorithm and IDMPersist interfaces. They can go after the declarations for ISupportsErrorInfo, which will already be in the file.

  11. Create two new C++ files, ALGORITHM.IDMAlgorithm.cpp and ALGORITHM.IDMPersist.cpp, using Add New Item on the Project menu. Next, put the following lines at the top of each:

    #include "stdafx.h"
    #include "ALGORITHM.h" 
    
  12. Copy the contents of IDMAlgorithm.Stubs.cpp and IDMPersist.Stubs.cpp into ALGORITHM.IDMAlgorithm.cpp and ALGORITHM.IDMPersist.cpp, respectively. In both files, replace all occurrences of the string YOUR_ALG_CLASS with ALGORITHM.

You should now be able to compile the project with no errors.

Step 3: Creating the FACTORY class

This step is very similar to the previous step. We will create the FACTORY class to implement both the IDMAlgorithmFactory and the IDMAlgorithmMetadata interfaces.

  1. Select Add Class on the Project menu.

  2. Expand the Visual C++ folder, select ATL Simple Object from the ATL folder, and click Open.

  3. Type in the name of your algorithm in the Short name edit box. We used the short name FACTORY, and then removed the "C" from the Class name so that the class name is also FACTORY.

  4. Change the interface from IFACTORY to IDMAlgorithmFactory.

  5. Click the Options tab. Choose Free for the threading model, choose Custom for the interface, and select the ISupportErrorInfo check box.

  6. Click Finish.

  7. In the resulting header file, which we assume is named FACTORY.h, include the following headers directly beneath the inclusion of resource.h:

    #include "StaticAlgorithmMetadata.h"
    #include "ALGORITHM.h"
    
  8. Specify that the FACTORY class also inherits from IDMAlgorithmMetadata:

    class ATL_NO_VTABLE FACTORY : 
             public CComObjectRootEx<CComMultiThreadModel>,
             public CComCoClass<FACTORY, &CLSID_FACTORY>,
             public ISupportErrorInfo,
             public IDMAlgorithmFactory,
       public IDMAlgorithmMetadata
    
  9. Add the following between BEGIN_COM_MAP and END_COM_MAP:

    COM_INTERFACE_ENTRY(IDMAlgorithmMetadata)
    
  10. Copy the header declarations for IDMAlgorithmFactory from the file IDMAlgorithmFactory.Stubs.h, located in the SRC\StubCode\ directory. Paste these declarations into a public area of the FACTORY definition.

  11. Copy the contents of IDMAlgorithmFactory.Stubs.cpp into the file implementing your factory class, which we assume is named FACTORY.cpp. In this file, replace the single occurrence of the string YOUR_FACTORY_CLASS with FACTORY, and replace the two occurrences of the string YOUR_ALG_CLASS with ALGORITHM.

    The next five steps assume that you want to implement the IDMAlgorithmMetadata interface using static-member functions on your ALGORITHM class. If this is not what you want to do, you can paste the declaration and implementation of this interface from the files SRC\StubCode\IDMAlgorithmMetadata.FactoryStubs.* into your FACTORY class.

  12. Include the following macro in a public area of the FACTORY class declaration:

    IMPLEMENT_STATIC_ALGORITHM_METADATA(ALGORITHM)
    

    This macro implements the IDMAlgorithmMetadata interface by calling the corresponding static members of the ALGORITHM class (which have not yet been declared in ALGORITHM).

  13. In a public area of your ALGORITHM class in the header file ALGORITHM.h, paste the declarations of the static-member IDMAlgorithmMetadata functions from the file IDMAlgorithmMetadata.StaticStubs.h.

  14. Create a new C++ file ALGORITHM.IDMAlgorithmMetadata.cpp (by selecting Add New Item on the Project menu), and add the following to the top of the file:

    #include "stdafx.h"
    #include "ALGORITHM.h" 
    
  15. Paste the implementation code for the IDMAlgorithmMetadata functions from IDMAlgorithmMetadata.StaticStubs.cpp into the newly created file ALGORITHM.IDMAlgorithmMetadata.cpp. Replace all occurrences of the string YOUR_ALG_CLASS with ALGORITHM.

  16. Add the parameter-handling macros into your ALGORITHM class definition and implementation. First, add the following to the list of header files at the top of ALGORITHM.h:

    #include "ParamHandler.h"
    

    Then, add the following to a public area of your ALGORITHM class definition:

             DECLARE_STATIC_PARAMETER_HANDLING()
    

    Finally, add the following to the file ALGORITHM.cpp:

    BEGIN_PARAMETER_DECLARATION(ALGORITHM)
             END_PARAMETER_DECLARATION(ALGORITHM) 
    

    You will define your algorithm-specific parameters by inserting instances of the DECLARE_PARAMETER macro between these two lines. These macros together will implement the nine IDMAlgorithmMetadata parameter-handling functions for you automatically.

You should again be able to compile the project with no errors.

Step 4: Creating the NAVIGATOR class

This step is similar to the previous two. We will create the NAVIGATOR class to implement IDMAlgorithmNavigation.

  1. Select Add Class on the Project menu.

  2. Expand the C++ folder, select ATL Simple Object from the ATL folder, and click Open.

  3. Type the name of your navigator class in the Short name edit box. We used the short name NAVIGATOR, and then removed the "C" from the Class name so that the class name is also NAVIGATOR.

  4. Change the interface from INAVIGATOR to IDMAlgorithmNavigation.

  5. Click the Options tab. Choose Free for the threading model, choose Custom for the interface, and select the ISupportErrorInfo check box. Click Finish.

  6. In the resulting header file, which we assume is named NAVIGATOR.h, include the following headers directly beneath the inclusion of resource.h:

    #include "DmhMemory.h"
    #include "dmalgo.h"
    #include "oledbdm.h"
    #include "ALGORITHM.h"
    
  7. Paste in the header declarations for IDMAlgorithmNavigation from the file IDMAlgorithmNavigation.Stubs.h, located in the SRC\StubCode\ directory.

  8. Copy the contents of IDMAlgorithmNavigation.Stubs.cpp into the file implementing your navigation class, which we assume is named NAVIGATOR.cpp. In this file, replace all occurrences of the string YOUR_NAVIGATION_CLASS with NAVIGATOR.

You should again be able to compile the project with no errors.

Step 5: Registering the algorithm with Analysis Services

In this step, we will enable your algorithm to be called by Analysis Services.

  1. First we will set the service name of the algorithm, which is the name that Analysis Services uses internally to refer to your algorithm. Go to the implementation of IDMAlgorithmMetadata::GetServiceName(), which is implemented by a static member of the ALGORITHM class in the file ALGORITHM.IDMAlgorithmMetadata.cpp. The stub code results in the service name being SERVICENAME. Change this to whatever name you want to assign to your algorithm; if you are going to step through the customization steps in Section 7, you should use the name PairWiseLinearRegression:

    static const LPWSTR szName   =   L"PairWiseLinearRegression";
    
  2. If you have not already started Analysis Services at least once, do so now to populate the initialization file. To start Analysis Services, click Start, click Control Panel, and double-click Administrative Tools (or, if you are using the Category View for Control Panel, click Performance and Maintenance and then single-click Administrative Tools). Then double-click Services, right-click MSSQLServerOLAPService, and select Start. If the Start option is not available, the service is already running.

  3. Locate the initialization file msmdsrv.ini. This is typically located in the directory \Program Files\Microsoft SQL Server\MSSQL.1\OLAP\bin\, on the drive where Analysis Services was installed. The file msmdsrv.ini is an XML document that you need to edit to include your algorithm in the list of algorithms available to the server.

  4. Open msmdsrv.ini with your favorite text editor.

  5. Look for the XML element <Algorithms> that is a child of the element <DataMining>, which in turn is a child of the root element <ConfigurationSettings>. Create a new child element inside <Algorithms> whose name is the service name of your algorithm that you defined in Step 1 above. We will assume that you are using PairWiseLinearRegression as the service name, and thus you will create the element <PairWiseLinearRegression> as a child of <Algorithms>.

  6. Next, add a child element of <PairWiseLinearRegression> called <ProgID>. This element contains the ProgID for the COM object that implements the IDMAlgorithmFactory interface; if you are using the same names as in this tutorial, this ProgID will be PlugIn.FACTORY.1.

    If you are not using the same names as in this tutorial, and if you have compiled the project after Step 3, there will be a file called FACTORY.rgs in the "Resource Files" section of your project. Look for the line that specifies the ProgID; it should be of the form PROJECTNAME.FACTORY.1.

  7. Finally, add a second child element of <PairWiseLinearRegression> called <Enabled>. The contents of this element should be either 1 or 0, depending on whether or not you want the server to be able to use your algorithm.

    The relevant portion of the resulting initialization file should now be:

    <ConfigurationSettings>
       ...
       <DataMining>
          ...
          <Algorithms> 
             ...
             <PairWiseLinearRegression>
                <ProgID>PlugIn.FACTORY.1</ProgID>
                <Enabled>1</Enabled>
             </PairWiseLinearRegression>
    
  8. Save the file.

  9. For Analysis Services to incorporate the changes you've made in the initialization file, you must restart the service. To restart Analysis Services, click Start, click Control Panel, double-click Administrative Tools (or if you are using the Category View for Control Panel, click Performance and Maintenance, and then single-click Administrative Tools), and then double-click Services. Right-click MSSQLServerOLAPService and select Restart.

    Whenever you recompile the project after this point, you will need to first stop Analysis Services (using Stop instead of Restart above) in order to write a new copy of the DLL. After compiling, you then start the service again (using Start instead of Restart above).

You now have the shell algorithm implemented and integrated into Analysis Services!

Step 6: Testing the shell algorithm

Although the shell algorithm does not do anything useful, you can verify that the interfaces are being called. Open the Business Intelligence Development Studio, which can be located by clicking Start, clicking All Programs, and then clicking Microsoft SQL Server.

Choose Blank Solution under New on the File menu. Click Business Intelligence Projects under Project Types and select Analysis Services Project under Templates. Change the name of the project to "Sample", and then click OK.

First, we will connect to a data source:

  1. In the Solution Explorer window, right-click Data Sources and select New Data Source.
  2. Click Next on the initial page, and then click New Connection on the Select how to define the connection page.
  3. Click the Provider tab of the resulting Data Link Properties dialog, select the Microsoft Jet 4.0 OLEDB Provider, and click Next.
  4. Click the ... button next to the Select or enter a database name edit box. Select the Sample.mdb file from the SRC\Demo\ directory and click OK.
  5. Now click Next on the Select how to define the connection page. Click Finish on the resulting page.

Next, we'll define a view of the data:

  1. In the Solution Explorer window, right-click Data Source Views and select New Data Source View.
  2. Click Next on the initial page and click Next again on the Select Data Source page.
  3. Click the top > button on the Select Tables and Views page to select the XYZ table, and then click Next. Click Finish on the next page.

Now we are ready to test the shell algorithm:

  1. Right-click Mining Models in the Solution Explorer window, and choose New Mining Model.

  2. Click Next on the initial wizard page, and then Next again on the Definition Method page, leaving From existing... as the method.

  3. On the Select the Data Mining Technique page, there is a drop-down list of available algorithms to use. If you click on the drop-down arrow, your algorithm should appear as the last member in the list, as Undefined Localized String:

    ms345112.pluginalg211(en-US,SQL.90).gif

    Figure 11. List of available algorithms, showing the new shell algorithm

  4. Cancel out of this dialog.

If Undefined Localized String did not appear, it might mean that Analysis Services was running when you last compiled your project. Try the following:

  1. Save the Business Intelligence solution and exit out of the Business Intelligence Development Studio.
  2. Stop Analysis Services and recompile your project.
  3. Start Analysis Services, then start the Business Intelligence Development Studio and open the Business Intelligence solution again.

To get a list of available algorithms, Analysis Services has used the initialization file to create an instance of your FACTORY class, and cached away corresponding instances of the IDMAlgorithmFactory and IDMAlgorithmMetadata interfaces. To extract the description of your algorithm for the drop-down list, Analysis Services calls the function IDMAlgorithmMetadata::GetDisplayName(). This function populates a description, which is a localizable string; because we have not yet defined this string, the description defaults to Undefined Localized String.

You can see the details of how this default string is populated by looking at the (static) function ALGORITHM::GetDisplayName() and the localization routine ::LoadStringFromID() from the file DmhLocalization.cpp.

The Customizing the Algorithm: Pair-wise Linear Regression Section

In this section, we customize the shell algorithm from the previous section to implement the pair-wise linear regression model described in Section 3. If you did not complete Section 6, you can start with the code provided in the directory SRC\CompletedShell\.

First, copy the following files from SRC\Utilities\ into your project directory PRJ\, and add them into the project:

  • DmhVector.h and DmhAllocator.h: As described in Section 4, these files allow you to use STL vectors with the memory-management interfaces supplied by Analysis Services.
  • DataValues.cpp and DataValues.h: These two files are used to translate between the value types used by Analysis Services and the simple type double, which is particularly convenient for our example algorithm.
  • caseprocessor.cpp and caseprocessor.h: As described in Section 4, these two files are used to implement the case-processing interface for you.

Next, copy the following files from SRC\CustomCode\ into your project directory PRJ\, and add them into the project:

  • Hierarchies.h: This file contains the structure which implements the two different hierarchies for models that are needed by the navigation interface.
  • lrsstatreader.cpp and lrsstatreader.h: These files contain the class LRSSTATREADER, which derives from the case processor, and collects from each case the statistics needed to learn a pair-wise linear regression model.
  • lrpmodel.cpp and lrpmodel.h: These files implement the LRPMODEL class, which is the class we use to represent the statistical model learned by our algorithm.

In the next five steps, we will customize the functions of the five interfaces to implement the pair-wise linear regression model. For most of the functions, the changes are simple and we write out explicitly the code that needs to be inserted into the completed-shell implementation. For some functions, however, there is enough code to be inserted that we instead include the customized function declarations in Customize.cpp located in the SRC\CustomCode\ directory. These functions should be copied into your own code. You should review the functions when you copy them, but the large amount of code is usually due to uninteresting bookkeeping, as opposed to insightful procedures.

Step 1: Customizing IDMAlgorithmMetadata

The first task is to implement all of the IDMAlgorithmMetadata functions defined in ALGORITHM.IDMAlgorithmMetadata.cpp to correspond to the model. There are a lot of these functions (35 in total), but each individual function is reasonably easy to customize. We consider them each in turn.

As you make the described changes, you should read the comments in the corresponding functions to make sure you understand the details.

  1. GetServiceType: Our algorithm is doing a pair-wise density estimation, so we set:

    
       *out_pServiceTypeID = DM_SERVICETYPE_DENSITY_ESTIMATE;
    
    
  2. GetServiceName: We have already changed this in the previous section to PairWiseLinearRegression. If you are using the code provided in the SRC\CompletedShell\ directory, and you would like to use another name, you need to change it here.

  3. GetDisplayName: You need to have a string ID for the display name, and a corresponding way to map that ID to a string. For the purposes of this tutorial, we'll implement our localization by simply doing a switch statement in the function LoadStringFromID() from the file DmhLocalization.cpp. Clearly, you should do something more appropriate for a real application. There are four localizable strings in our demo: the service display name, the service description, the description of the MINIMUM_DEPENDENCY_SCORE parameter, and the description of the DISPLAY_CORRELATION parameter.

    Open the PRJ\resource.h file and add:

          #define IDS_SERVICE_DISPLAY_NAME         105
    #define IDS_SERVICE_DESCRIPTION         106
    #define IDS_MINIMUM_DEPENDENCY_SCORE_DESCR   107
    #define IDS_DISPLAY_CORRELATION_DESCR      108
    

    Increment the value following _APS_NEXT_SYMED_VALUE to 109. (If 105 is already being used as a resource ID because you have added other resources, you will need to change the numbers as appropriate.)

    Include the file Resource.h at the top of the file ALGORITHM.IDMAlgorithmMetadata.cpp:

    #include "Resource.h"
    

    Replace the entire function body of LoadStringFromID() in DmhLocalization.cpp with the following:

       switch (iIDString)
       {
          case IDS_SERVICE_DISPLAY_NAME:
    
             swprintf(wchar, L"Pairwise Linear Regression");
             break;
    
          case IDS_SERVICE_DESCRIPTION:
    
             swprintf(wchar, L"Build a linear regression model for every pair of continuous variables in the domain");
             break;
    
          case IDS_MINIMUM_DEPENDENCY_SCORE_DESCR:
    
             swprintf(wchar, L"Minimum score needed to include a pair in the browser and in prediction");
             break;
    
          case IDS_DISPLAY_CORRELATION_DESCR:
    
             swprintf(wchar, L"Display correlation coefficients instead of linear coefficients when browsing");
             break;
    
    
          default:
    
             swprintf(wchar, L"Unexpected string ID");
       }   
    
       return S_OK;
    

    Note that we are not checking the ccharMax argument to compare with the length of the appropriate output string; in a real application, you will need to attend to these details.

    Include the file Resource.h at the top of the file DmhLocalization.cpp:

    #include "Resource.h"
    

    Now, in the implementation of IDMAlgorithmMetadata::GetDisplayName(), change the value of iIDStrAlgorithm from -1 to IDS_SERVICE_DISPLAY_NAME.

  4. GetServiceGuid: Use any GUID that will uniquely identify the algorithm. We will use the CLSID of the algorithm class, and thus:

    *out_pguidServiceGUID = CLSID_ALGORITHM;
    
  5. GetDescription: This is the second localizable string. Similar to (3), change the value of iIDStrDescription from -1 to IDS_SERVICE_DESCRIPTION.

  6. GetPredictionLimit: We won't have a limit, so leave at 0.

  7. GetSupDistributionFlags: Leave as-is.

  8. GetSupInputContentTypes: We only support continuous input variables, and we always need to support the key column, so use the following as the only ones (that is, simply un-comment the continuous one):

    (ULONG) DM_MININGCOLUMN_CONTENT_KEY,
    (ULONG) DM_MININGCOLUMN_CONTENT_CONTINUOUS, 
    
  9. GetSupPredictContentTypes: We only support continuous output variables, so comment out discrete and un-comment continuous:

       //   (ULONG) DM_MININGCOLUMN_CONTENT_DISCRETE,
          (ULONG) DM_MININGCOLUMN_CONTENT_CONTINUOUS,
    
  10. GetSupModelingFlags: In order to work with the decision-tree browser, we need to declare that we support the "regressor" modeling flag. The appropriate regressor modeling flag is defined in dmalgo.h, in the DM_MODELING_FLAG_ENUM enumeration. Replace the entire body of the function with the following code:

       static DM_MODELING_FLAG   rgiModelingFlag[] = 
       {      
          DM_MODELING_FLAG_REGRESSOR,
       };
    
       static const ULONG ciModelingFlag = 
          sizeof(rgiModelingFlag)/sizeof(DM_MODELING_FLAG);
    
       *out_prgFlags   = rgiModelingFlag;
       *out_cFlags      = ciModelingFlag;
    
    return S_OK;
    
  11. GetModelingFlagName: Leave as-is, since the pair-wise linear regression algorithm does not have any custom modeling flags.

  12. GetTrainingComplexity: We're only going to need a single pass over the data, so use:

           *out_pTrainingComplexity = DM_TRAINING_COMPLEXITY_LOW;
    
    
  13. GetPredictionComplexity: Our prediction time will scale linearly with the number of input variables, so use:

           *out_pPredictionComplexity = DM_PREDICTION_COMPLEXITY_LOW;
    
    
  14. GetExpectedQuality: Leave as-is.

  15. GetScaling: Leave as-is.

  16. GetAllowIncrementalInsert: Leave as-is.

  17. GetAllowDuplicateKey: Leave as-is.

  18. GetControl: Leave as-is.

  19. GetViewerType: We'll be using the decision-tree viewer to view our model: For each output variable, we'll have a single tree with no splits, and we'll list in the single leaf node all of the regression coefficients as if it were a multiple-regression model. Thus, we need only un-comment the existing sample code (and delete the final return S_OK) that specifies the decision-tree viewer.

  20. GetSupportsDMDimensions: We'll support data mining dimensions, which means that we can slice an OLAP cube based on the output of our model. Change the function to include:

       *out_pfSupportsDMDimensions = TRUE;
    
  21. GetSupportsDrillthrough: We'll support Drillthrough, which means that we can identify data cases that are relevant to different portions of the model while browsing. Change the function to include:

       *out_pfSupportsDrillthrough = TRUE;
    
    

    Keep the must-include-children flag equal to TRUE.

  22. GetSupportedFunctions: Leave as-is.

The next set of functions (numbers 23-32) deal with algorithm-specific parameters that you define. As mentioned above when we created the localizable string IDs, we have two parameters: MINIMUM_DEPENDENCY_SCORE determines the score needed for a pair to be included in the viewer and in the prediction algorithm; DISPLAY_CORRELATION determines whether or not to show correlation coefficients instead of linear coefficients when browsing our models. We will define MINIMUM_DEPENDENCY_SCORE to be a float with default value 3 (this corresponds to the requirement that the data implies a dependency is three times more likely than no dependency). We will define DISPLAY_CORRELATION to be a Boolean with default value false.

Each algorithm-specific parameter has a number of properties:

  • The name of the parameter, which is MINIMUM_DEPENDENCY_SCORE or DISPLAY_CORRELATION in our case. Names of parameters are not case sensitive (for example, disPlaY_CorRElaTIon is allowed for our algorithm).

  • A description of the parameter, which is a localizable string. We'll use the resource identifiers IDS_MINIMUM_DEPENDENCY_SCORE_DESCR and IDS_DISPLAY_CORRELATION_DESCR, as described above.

  • A type, which is specified using a string. The string for a type matches the name of the DBTYPEENUM enumeration in file oledb.h. Examples include DBTYPE_I4 and DBTYPE_R4. We'll use:

    • DBTYPE_R4, corresponding to a float, for MINIMUM_DEPENDENCY_SCORE.
    • DBTYPE_BOOL, corresponding to a Boolean, for IDS_DISPLAY_CORRELATION.

    To simplify the code, we will specify parameters using this enumeration and provide translation routines from strings to this enumeration.

  • A flag indicating whether or not the value is required. If it is required, the user must explicitly specify the value. Neither of our two parameters will be required.

  • A flag indicating whether or not the value is exposed (visible to the user). We will expose both parameters.

  • A set of flags, represented in a ULONG. We will not have any flags associated with either of our parameters.

  • A default value, represented as a string. As stated above, we'll use default values of 3 and false for MINIMUM_DEPENDENCY_SCORE and DISPLAY_CORRELATION, respectively.

  • An enumeration, which is a user-friendly (non-localizable) string describing the range of the parameter. For MINIMUM_DEPENDENCY_SCORE we will allow any number, so we'll use (-inf, inf) to denote the range from negative infinity to infinity. For DISPLAY_CORRELATION, the enumeration will be TRUE or FALSE.

If you are using static parameter handling, all of the parameter functions are implemented for you automatically; you need only specify the seven properties of each parameter using the macro DECLARE_PARAMETER() placed between

BEGIN_PARAMETER_DECLARATION(ALGORITHM) and END_PARAMETER_DECLARATION(ALGORITHM) in the file ALGORITHM.cpp. For our two parameters, the relevant code is:

BEGIN_PARAMETER_DECLARATION(ALGORITHM)
   DECLARE_PARAMETER(L"MINIMUM_DEPENDENCY_SCORE",   // Name
            IDS_MINIMUM_DEPENDENCY_SCORE_DESCR,   // Res ID 
            DBTYPE_R4,      // Type, as a DBTYPEENUM
            false,         // Required flag
            true,         // Exposed flag
            0,            // General flags
            L"3",         // Default value, as a string
              L"(-inf,inf)")   // Enumeration, as a string
   DECLARE_PARAMETER(L"DISPLAY_CORRELATION",   // Name
            IDS_DISPLAY_CORRELATION_DESCR,   // Res ID 
            DBTYPE_BOOL,   // Type, as a DBTYPEENUM
            false,         // Required flag
            true,         // Exposed flag
            0,            // General flags
            L"FALSE",         // Default value, as a string
              L"TRUE or FALSE")   // Enumeration, as a string
END_PARAMETER_DECLARATION(ALGORITHM) 

If you are not using static parameter handling, and if you are not using the above macros, you must implement the following functions, which should be reasonably simple (although somewhat tedious if you have a lot of parameters).

  1. GetNumParameters
  2. GetParameterName
  3. GetParameterType
  4. GetParameterIsRequired
  5. GetParameterIsExposed
  6. GetParameterFlags
  7. GetParameterDescription
  8. GetParameterDefaultValue
  9. GetParameterValueEnumeration
  10. ParseParameterValue

If you have parameters with non-numeric types, you will need to modify ParseParameterValue to perform the appropriate parsing, even if you are using the static parameter handling.

The remaining IDMAlgorithmMetadata functions are:

  1. GetMarginalRequirements: Due to a problem that will be fixed soon, you are required to have Analysis Services build marginal statistics for you. Thus, change the requirements using:

    *out_pReq = DMMR_ALL_STATS;
    
  2. GetCaseIDModelled: Leave as-is.

  3. ValidateAttributeSet: We simply need to verify that every attribute is continuous, which can be done by adding the following code:

       ULONG cAttribute;
    
       HRESULT hr = in_pAttributeSet->GetAttributeCount(&cAttribute);
    
       RETURN_ON_FAIL(hr);
    
       for (UINT iAttribute = 0; iAttribute < cAttribute; iAttribute++)
       {
          DM_ATTRIBUTE_FLAGS dm_attribute_flags;
    
          hr = in_pAttributeSet->GetAttributeFlags(iAttribute, &dm_attribute_flags);
    
          RETURN_ON_FAIL(hr);
    
          if (!(dm_attribute_flags & DMAF_CONTINUOUS))
          {
             return E_FAIL;
          }
       }
    
       return S_OK;
    

You should be able to compile the project with no errors.

Step 2: Customizing IDMAlgorithmFactory

You need not make any changes to the IDMAlgorithmFactory interface, but it is useful to review the code to see how the single function CreateAlgorithm allows your algorithm to use the IDMModelServices interface.

Recall from Section 5.4 that if you want to allocate any memory with your algorithm that lasts past the lifetime of a function call, you need to cache a special model-services interface, IDMModelServices, away with your algorithm. IDMModelServices is available only when the model is created with IDMAlgorithmFactory::CreateAlgorithm(). In this tutorial, we solve this problem by deriving ALGORITHM from the memory-allocator class DMHALLOC; then, in CreateAlgorithm, we initialize the base class to refer to the memory-allocator interface within the passed-in IDMModelServices interface. As a result, you can allocate memory for your algorithm as follows:

void ALGORITHM::ExampleMemberFunction()
{
      // Create a new structure and store into member variable

      _pmyclass = new(*this) MYCLASS();
} 

Step 3: Customizing IDMAlgorithm

Next, we customize all the functions from the IDMAlgorithm interface from the file ALGORITHM.IDMAlgorithm.cpp.

  1. Initialize: We need to cache away the IDMAttributeSet interface passed in. Add the following public member variable to the ALGORITHM class declared in ALGORITHM.h:

    public:
    
          CComPtr<IDMAttributeSet>   _spidmattributeset;
    

    Then replace the existing body of Initialize() with:

    
       // Cache away the attribute set
       _spidmattributeset = in_pAttributeSet;
    
       return S_OK;
    
    
  2. InsertCases: This is the function that trains the model from data.

    First, we want to parse the parameters that have been passed in. As mentioned in Section 3, we will display either the linear coefficients or the correlation coefficients, depending on the value of an algorithm parameter.

    Some of the algorithm parameters are only used in the training process. For such parameters (like MINIMUM_DEPENDENCY_SCORE), using a local variable to store their values only for the scope of InsertCases is usually enough. But the actual value used in training is part of a discovery schema, supported by the server, and can, therefore, be requested at any time by the server from the plug-in algorithm using the GetTrainingParameterActualValue function of the IDMAlgorithm interface. It is useful, then, to preserve the values of all the training parameters in member variables of the ALGORITHM class.

    In the file ALGORITHM.h, add the following member variables to the ALGORITHM class:

    bool      _bDisplayCorrelation;
    double   _dblMinDepScore;
    

    Then initialize these variables to false and 0.0, respectively, in the constructor:

    ALGORITHM() 
       {
          _bDisplayCorrelation = false;
          _dblMinDepScore = 0.0;
    }
    

    Assuming that we've used static parameter handling, it is easy to parse the two parameters. Add the following to the top of ALGORITHM::InsertCases:

       HRESULT hr = _dmhparamhandler.GetParameterValue(in_pExeContext, 
                         _dblMinDepScore,
                         L"MINIMUM_DEPENDENCY_SCORE",
                         in_ulNumParameters,
                         in_rgParameterNames,
                         in_rgParameterValues);
    
       RETURN_ON_FAIL(hr);
    
       hr = _dmhparamhandler.GetParameterValue(in_pExeContext, 
                         _bDisplayCorrelation,
                         L"DISPLAY_CORRELATION",
                         in_ulNumParameters,
                         in_rgParameterNames,
                         in_rgParameterValues);
    
    RETURN_ON_FAIL(hr);
    

    Next, we want to scan the data to extract the appropriate statistics for the linear regression model. It turns out that these statistics are:

    • For each pair of attributes X and Y we need the sum, over all cases, of the product X and Y.
    • For each attribute, we need the sum of that attribute over all cases.
    • We need a count of the total number of cases.

    There is a generic case-processing class that simplifies data access for you; it is called CASEPROCESSOR, and it can be found in caseprocessor.h. This class takes a CASEREADER upon initialization; we need only derive a class from CASEREADER that implements the function.

    virtual HRESULT ProcessCaseDense(ULONG ulID, VDBL& vdblValue);
    
    

    In the file lrsstatreader.h, we have an implementation of the class LRSSTATREADER that collects the necessary sums-of-products and sums for the model. Add the following to the top of ALGORITHM.IDMAlgorithm.cpp:

    #include "lrsstatreader.h"
    

    Then add the following code to InsertCases() after the parameter-extraction code you just pasted in this function:

       DMHALLOC   dmhalloc; // Memory-allocation wrapper.
    
       dmhalloc.SetAllocator(in_pExeContext);   
    
       // Create a case processor.
    
       CComObjectDM<CASEPROCESSOR>*   pcaseprocessor = NULL;
    
       hr = CComObjectDM<CASEPROCESSOR>::CreateInstance(dmhalloc, &pcaseprocessor);
    
       RETURN_ON_FAIL(hr);
    
       LRSSTATREADER lrsstatreader(dmhalloc);
    
       // Initialize the reader with the attributes.
    
       lrsstatreader.Initialize(_spidmattributeset);
    
       // Initialize the case processor with our reader.
    
       hr = pcaseprocessor->Initialize(_spidmattributeset, in_pCaseSet, &lrsstatreader);
    
       RETURN_ON_FAIL(hr);
    
    // QI the CASEPROCESSOR for IDMCaseProcessor.
    
       CComPtr<IDMCaseProcessor>   spidmcaseprocessor;
    
       hr = pcaseprocessor->QueryInterface(&spidmcaseprocessor);
    
       RETURN_ON_FAIL(hr);
    
       // Load in all of the statistics.
    
       hr = in_pCaseSet->StartCases(spidmcaseprocessor, false /*need case id*/);
    
       RETURN_ON_FAIL(hr);
    

    Now LRSSTATREADER contains all of the statistics we need to build the pair-wise regressions. Include the file lrpmodel.h to the header file ALGORITHM.h implementing your ALGORITHM class:

    #include "lrpmodel.h"
    

    This defines the class LRPMODEL, which we will use to represent the actual statistical model.

    In the ALGORITHM class, add the member variable:

    LRPMODEL   _lrpmodel;
    
    

    Finally, add the following code to the end of InsertCases():

       // We must use the model-level allocator for _lrpmodel because 
       // this member's data will live past the lifetime of the current
    // call.
    
       _lrpmodel.SetAllocator(*this);
    
       hr = _lrpmodel.PopulateModel(lrsstatreader, _dblMinDepScore);
    
       return hr;
    
    
  3. Predict: For this function, you should simply paste in the corresponding code from the file SRC\CustomCode\Customize.cpp. The code takes the input case, turns it into a dense representation (where the ith element of a vector corresponds to the value for attribute i), and then calls the lrpmodel.ExtractPosterior() function. There is some bookkeeping that needs to be done, but it is all straightforward.

    The supplied function uses the square-root function, so you need to include math.h at the top of ALGORITHM.IDMAlgorithm.cpp:

    #include <math.h>
    
  4. GetNodeIDsForCase: For this function, you should paste in the corresponding code from the file Customize.cpp. First, you should include the header Hierarchies.h at the top of ALGORITHM.IDMAlgorithm.cpp:

    #include "Hierarchies.h"
    

    There is a technical glitch with allocating vectors of strings; the workaround is to copy AllocationFix.h from the SRC\CustomCode directory into the project directory, add it to your project, and then include this file in ALGORITHM.IDMAlgorithm.cpp above the definition of GetNodeIDsForCase:

    #include "AllocationFix.h"
    

    This will define the function AllocStringVectorForServer, which is called by GetNodeIDsForCase.

    GetNodeIDsForCase maps cases to node IDs, which are simply labels for nodes in a tree layout of your model. There are two types of tree layouts that are used for your model, and they can be different. First there is a browsing layout that will be used by a browser that allows users to interactively navigate through the content of the model. If you allow Drillthrough (which we do in the tutorial), GetNodeIDsForCase is used to map cases in the data to relevant nodes in the browsing layout. The second type of layout is the DM-dimension layout. This layout is a hierarchy that can be used to slice an OLAP cube as if your model were itself an attribute.

    The implementation of this function is closely related to the implementation of the navigator that you return by GetNavigator. The navigator will be used in conjunction with GetNodeIDsForCase to determine where cases map in your model.

    The two layout schemes are defined as follows. The browsing layout, which must follow the same format as decision trees in order to work with the internal tree browser, consists of a single tree. The root node (which is of type DM_NODE_TYPE_MODEL when accessed through the navigator) has a child node for each output attribute. Each such child node is the root of a decision tree, which in our case contains no splits; each child node is also a leaf node in a trivial decision tree, and contains the distribution information used by the browser.

    For the DM-dimension layout, we need a separate tree for each output attribute. To make things interesting, we define a non-trivial tree for this hierarchy: each output attribute has a tree consisting of a root with two (leaf) children. The mapping from cases to the leaves of the tree works as follows: if the posterior mean of the output attribute is greater than zero, the case maps to the first child of the corresponding root; otherwise the case maps to the second child of the root.

    All of the logic of these two hierarchies is implemented by the structure LRPHIERARCHY found in the file Hierarchies.h. The GetNodeIDsForCase function uses this structure to extract the appropriate node ID. If the caller is requesting the node ID for the DM-dimension layout, the function must extract the posterior distribution for each output attribute using the values in the passed-in case.

  5. GetNavigator: This function creates an instance of the NAVIGATOR class and then passes back the corresponding IDMAlgorithmNavigation interface. The appropriate code is almost the same as the commented-out code in the stub implementation. Because our NAVIGATOR class derives from DMHALLOC, however, we create new instances using the extended ATL templates from DmhMemory.h (that is, we use the CComObjectDM template instead of the CComObject template, and pass in a DMHALLOC reference to the CreateInstance function). Also, we must call the NAVIGATOR::Initialize function that we define in Step 6 below. First, we need to include the NAVIGATOR.h header file at the top of ALGORITHM.IDMAlgorithm.cpp:

    #include "NAVIGATOR.h"
    

    Following is the code for GetNavigator:

    
    // Create an instance of your navigation class and return the
    // desired interface. 
    
       CComObjectDM<NAVIGATOR>*   pNavClass;
    
       HRESULT hr = CComObjectDM<NAVIGATOR>::CreateInstance(*this, &pNavClass);
    
       RETURN_ON_FAIL(hr);
    
    hr = pNavClass->Initialize((in_fDimension == TRUE) ? true : false, this);
    
       RETURN_ON_FAIL(hr);
    
       if (pNavClass)
       {
          hr = pNavClass->QueryInterface(
    __uuidof(IDMAlgorithmNavigation), 
                      (void**) out_ppDAGNav);
    
          RETURN_ON_FAIL(hr);
       }
    
       return hr;
    
    
  6. GetSampleCaseSet: Leave as-is.

  7. GetTrainingParameterActualValue: For this function, you should paste in the corresponding code from the file Customize.cpp. It is supposed to return the actual value used in training for each of the algorithm parameters. The actual value may be different from what was requested by the user in situations like these:

    • The value specified by the user indicates some "auto-detect" mechanism, that computes an optimized value based on the input data.
    • The user does not specify anything and a default value is used.

    The function takes an index (ULONG) as input argument and is supposed to populate a variant handle. The index is the ordinal of the algorithm parameter as specified by the IDMAlgorithmMetadata implementation. In the discussion on InsertCases we mentioned that preserving the parameter values as member variables will be useful in this function. Therefore, this function should simply fill the output variant handle with the value of one of the member variables that contain parameter values:

        VARIANT     varTmp;
        ::VariantInit(&varTmp);
        switch( in_iParameter )
        {
            case 0:
                V_VT(&varTmp)   =   VT_R8;
                V_R8(&varTmp)   =   (DOUBLE)_dblMinDepScore;
                break;
            case 1:
                V_VT(&varTmp)   =   VT_BOOL;
                V_BOOL(&varTmp) =   (BOOL)_bDisplayCorrelation;
                break;
            default:
                // Parameter index out of range
                return E_INVALIDARG;
        }
    
    // Smart pointer for a variant handler
    CComPtr<IDMVariantPtrHandler> spidmvarianthandler;
    
    HRESULT hr = in_pContext->GetVariantHandler(&spidmvarianthandler);
    
    RETURN_ON_FAIL(hr);
    
    return spidmvarianthandler->CopyVariantToHandle(io_pParameterValue, &varTmp);
    
  8. HasFeatureSelection: Leave as-is, since pair-wise linear regression does not use feature selection.

  9. GetFeatureSelectedAttributes: Leave as-is, since pair-wise linear regression does not use feature selection.

  10. GetAttributeFeatureSelectionFlags: Leave as-is, since pair-wise linear regression does not use feature selection.

If you compile the project at this point, you should get a single error in ALGORITHM::GetNavigator, due to the fact that we have not yet declared the function NAVIGATOR::Initialize.

Step 4: Customizing IDMPersist

In this step, we implement the IDMPersist interface on the ALGORITHM class. This interface consists of two functions: Load and Save. These functions load and save the statistical model created by your algorithm using the IDMPersistenceReader and IDMPersistenceWriter interfaces, respectively. Put all of the code into the file ALGORITHM.IDMPersist.cpp. It is straightforward to implement this interface, and you should simply copy into ALGORITHM.IDMPersist.cpp the implementation of the two relevant functions from Customize.cpp, along with the definition of DM_PERSIST_ENUM and the implementation of the four helper functions: LoadVvlrparam, SaveVvlrparam, LoadVdbl, and SaveVdbl.

Please note that the Save and Load functions are persisting the member variables that contain the training parameter actual values.

Step 5: Customizing IDMAlgorithmNavigation

The final interface to customize for the pair-wise linear regression model is the navigation interface implemented by the NAVIGATOR class. This interface is easy to implement with the help of the LRPHIERARCHY structure defined in Hierarchies.h.

First, in NAVIGATOR.h we will add DMHALLOC as a base class for NAVIGATOR, so that it can more easily contain STL vector member variables:

class ATL_NO_VTABLE NAVIGATOR : 
      public DMHALLOC,
      public CComObjectRootEx<CComMultiThreadModel>, ...

Next, we will add some state to the NAVIGATOR class: We'll add:

  1. A pointer to the ALGORITHM class that we are navigating.
  2. A Boolean that indicates whether it is a DM-dimension navigator or a browsing navigator.
  3. An STL vector that contains the indices of all the output attributes.
  4. An LRPHIERARCHY structure to do most of the work for us.
  5. A current-node index.

Include the header files DmVector.h and Hierarchies.h at the top of NAVIGATOR.h:

#include "DmhVector.h"
#include "Hierarchies.h"

Add the following member variables to the definition of NAVIGATOR in NAVIGATOR.h:

   protected:

     ALGORITHM*   _palgorithm;
     bool      _bDMDimension;
     VINT      _viAttributeOutput; // Output attributes
     LRPHIERARCHY   _lrphierarchy;
     ULONG      _iIDNode;    // Current state of the navigator

Then add the declaration of the initialize function that we called from ALGORITHM::GetNavigator:

   public:

     HRESULT      Initialize(bool bDMDimension, 
                                   ALGORITHM* palgorithm);

Paste in the implementation for NAVIGATOR::Initialize from Customize.cpp into NAVIGATOR.cpp. This function simply initializes the new member variables of the NAVIGATOR class based on the input arguments.

The specialized STL vectors we are using (see Section 5.5) require a DHMALLOC structure upon construction, so we need to modify the constructor of NAVIGATOR (located in the header file) to take care of this. Also, we'll initialize the new member variables in the constructor in NAVIGATOR.h. Replace the existing constructor with the following code (note that the _viattributeOutput member is initialized too):

NAVIGATOR() : _viAttributeOutput(*this)
   {
      _palgorithm      = NULL;
      _bDMDimension   = false;
      _iIDNode      = (UINT) -1;
}

Include the header file DataValues.h at the top of the file NAVIGATOR.cpp:

   #include "DataValues.h"   

We are now ready to customize the IDMAlgorithmNavigation functions defined in NAVIGATOR.cpp. We consider each function in turn. Most of them are trivial given the functionality of the LRPHIERARCHY structure.

  1. MoveToNextTree: This function sets _iIDNode to be the root of the next tree in the sequence. Note that if we are in browser mode, there will never be a next tree. Replace the entire body of the function with the following:

    ULONG iIDNodeNext = _lrphierarchy.IIDRootNext(_iIDNode);
    
       if (iIDNodeNext == (ULONG) -1)
       {
          return S_FALSE;
       }
    
       _iIDNode = iIDNodeNext;
    
    return S_OK;
    
  2. GetNodeID: Replace the -1 in the stub code with _iIDNode:

    *out_pNodeID = (DM_DAGNodeID) _iIDNode;
    
  3. LocateNode: Add the following:

    _iIDNode = (ULONG) in_NodeID;
    
    
  4. ValidateNodeID: Replace the function body with the following code:

    if (_lrphierarchy.BValidID((ULONG) in_NodeID))
       {
          return S_OK;
       }
       else
       {
          return E_FAIL;
    }
    
  5. GetParentCount: Every node in both hierarchies is either a root or a leaf. Thus, replace the existing code with:

    if (_lrphierarchy.BLeaf(_iIDNode))
       {
          *out_pulParents = 1;
       }
       else
       {
          *out_pulParents = 0;
       }
    
       return S_OK;
    
  6. MoveToParent: Add the following:

       _iIDNode = _lrphierarchy.IIDParent(_iIDNode);
    
  7. GetParentNodeID: Add the following:

       *out_pNodeID = _lrphierarchy.IIDParent(_iIDNode);
    
  8. GetChildCount: Replace the assignment of 0 with:

    *out_pulChild = _lrphierarchy.CChild(_iIDNode);
    
  9. MoveToChild: Add the following:

    _iIDNode = _lrphierarchy.IIDChild(_iIDNode, in_ChildIndex);
    
  10. GetChildNodeID: Replace the assignment of -1 with:

          *out_pNodeID = _lrphierarchy.IIDChild(_iIDNode, in_ChildIndex);
    
  11. MoveToNextLeaf: Replace the body of the function with:

    ULONG iIDLeaf = _lrphierarchy.IIDLeafNext(_iIDNode);
    
       if (iIDLeaf == (ULONG) -1)
       {
          return S_FALSE;
       }
    
    _iIDNode = iIDLeaf;
    
    return S_OK;
    
  12. AddRefNodeID: Leave as-is.

  13. ReleaseNodeID: Leave as-is.

  14. GetNodeProperty: For this function, replace with the implementation from Customize.cpp. The function is used to return one of eleven scalar properties associated with the current node in the model. These properties include things such as the type and description of the node. As you can see from the code, the customized function is simply a switch statement that populates a VARIANT based on the desired property.

  15. GetNodeArrayProperty: This function is very similar to the previous one, except that the desired properties are returned as arrays. As before, you should copy the function from Customize.cpp. There are only two array properties: the set of attributes corresponding to the current node, and an array of distribution elements corresponding to the current node.

  16. GetNodeUniqueName: Replace the function body with the following:

    return GetUniqueNameFromNodeID(in_pContext, 
     _iIDNode, 
     io_pstrUniqueName);
    
  17. GetNodeIDFromUniqueName: Add the following:

    CComPtr<IDMStringHandler> spidmstringhandler;
    
    HRESULT hr = in_pContext->GetStringHandler(&spidmstringhandler);
    
    RETURN_ON_FAIL(hr);
    
       const WCHAR* szName = NULL;
       UINT cch;
    
    hr = spidmstringhandler->GetConstStringFromHandle( 
    in_pstrUniqueName, 
                            &szName, 
                            &cch);
    
       RETURN_ON_FAIL(hr);
    
       *out_pNodeID = _lrphierarchy.IDFromSz(szName);
    
  18. GetUniqueNameFromNodeID: Replace the function body with the following:

    // Translate from node id to unique string name.
    
       WCHAR szName[21]; // Enough for models with 10^{20} nodes.
    
       _lrphierarchy.GetSzFromID(szName, (ULONG) in_NodeID, 20);
    
       CComPtr<IDMStringHandler> spidmstringhandler;
    
       HRESULT hr   = in_pContext->GetStringHandler(&spidmstringhandler);
    
       RETURN_ON_FAIL(hr);
    
       UINT cch = (UINT) wcslen(szName);
    
       return spidmstringhandler->CopyBufferToHandle(out_pstrUniqueName, szName, cch);
    
    

And that's it! You should be able to compile the project with no errors.

The Using the Customized Plug-in Algorithm Section

In this section, we show how to use the customized plug-in algorithm within the Business Intelligence Development Studio application. If you have not completed Section 7, you can instead compile the project in SRC\CompletedDemo.

First start (or restart) Analysis Services using the instructions in the last substep (number 9) of "Step 5: Registering the Algorithm with Analysis Services" in Section 6.

Open the Business Intelligence Development Studio, which can be located by clicking Start, clicking All Programs, and then clicking Microsoft SQL Server. Open the project Sample.slnbi that you created in Step 6 of Section 6. (If you have not yet closed the Business Intelligence solution, do so now; Business Intelligence Development Studio caches the names of the algorithms, and thus it needs to be restarted in order to reflect the changes.) If you have not completed Section 6, go to Step 6 now and follow the instructions for (a) creating a new project, (b) selecting a data source, and (c) defining a data source view.

Building a Model

To build a model, do the following:

  1. Right-click Mining Models in the Solution Explorer and choose New Mining Model.

  2. Click Next on the initial wizard page, and then Next again on the Definition Method page, leaving From existing... as the method.

  3. On the Select the Data Mining Technique page, there is a drop-down list of available algorithms to use. If you click on the drop-down arrow, the demo algorithm Pairwise Linear Regression should appear as the last member in the list. Select the new algorithm and click Next.

  4. Click Next on the next two pages (Select Data Source View and Specify Table Types) without making any changes. On the Specify the Training Data page, select Input and Predictable for all three attributes (X, Y, and Z). The page should look as follows:

    ms345112.pluginalg212(en-US,SQL.90).gif

    Figure 12. Configuring the Specify the Training Data page

  5. Click Next to advance to the final page, and then click Finish.

    You should now see a tree view that contains a leaf element for every column in the data (including the ID field). Right-click on the X column, and select Properties. The following window should appear (possibly docked on the bottom right-hand side of the main window):

    ms345112.pluginalg213(en-US,SQL.90).gif

    Figure 13. Properties window for attribute X

  6. You can use this property page to set properties on the attributes. To display the plug-in algorithm's model correctly in the decision-tree viewer, we need to add the REGRESSOR property to all of the attributes. To do this, click the edit box to the right of ModelingFlags; click the ... box that appears; select REGRESSOR by first selecting the row containing REGRESSOR and then selecting the corresponding check box. Repeat this procedure for all of the attributes.

  7. Build the model by selecting Deploy Solution on the Build menu. The output window should contain the following:

    ms345112.pluginalg214(en-US,SQL.90).gif

    Figure 14. Output displayed while the model is built

Viewing a Model

To view the model that was built, select the Mining Model Viewer icon in the View pane. After loading, you will see a single-node decision tree, and the following dialog that provides the details of the model for the X attribute:

ms345112.pluginalg215(en-US,SQL.90).gif

Figure 15. Details of the model for attribute X in the Mining Model Viewer

As described in Section 3, the coefficients 0.055 and -0.066 correspond to the pair-wise linear coefficients when X is regressed on Y and Z (individually), respectively. The offset 17.102 is simply the marginal mean of X, and the standard deviation shown in the equation at the bottom of the dialog is the marginal standard deviation of X.

You can view the pair-wise coefficients for the other attributes by selecting those attributes from the Tree drop-down list.

Making Predictions

To make predictions with the model, select the Mining Model Prediction icon in the View pane. We will compare the model predictions to the values contained in the training data: In the Select Input Table(s) dialog, click Select case table, expand the Sample node, select the XYZ table, and click OK.

Drag X from the Mining Model dialog into the Source column in the top row of the table. In the Alias field of the top row, type Predicted X. Now drag X from the Select Input Table(s) dialog into the Source column in the second row of the table, and set the corresponding Alias field to Actual X. The list should now look as follows:

ms345112.pluginalg216(en-US,SQL.90).gif

Figure 16. Configuring Mining Model Prediction

Click on the table icon above the Mining Model dialog, and you will see the following table:

ms345112.pluginalg217(en-US,SQL.90).gif

Figure 17. Mining Model Prediction table for attribute X

The first column is obtained by calling the ALGORITHM::Predict function that we implemented!

Setting Parameters

Select the Mining Models icon in the View list, right-click the XYZ column (the column on the right) in the table that appears, and select Properties. The properties window should now look as follows:

ms345112.pluginalg218(en-US,SQL.90).gif

Figure 18. Properties window for XYZ

You can set the model-specific parameters we defined by clicking the Set Algorithm Parameters box and then clicking the resulting ... button. The following dialog will appear:

ms345112.pluginalg219(en-US,SQL.90).gif

Figure 19. Algorithm Parameters dialog box

In the Value column for the DISPLAY_CORRELATION parameter, type TRUE and then click OK. Rebuild the model by selecting Deploy Solution under the Build menu.

Click the Mining Model Viewer icon in the View pane, and then click the Refresh icon to the right of the (top-most) Mining Model drop-down list. Now the coefficients are set to be the pair-wise correlations instead of the regression coefficients. The dialog for the X attribute in the viewer should look as follows:

ms345112.pluginalg220(en-US,SQL.90).gif

Figure 20. Browser display for attribute X using pair-wise correlation coefficients

Conclusion

And that's all there is to it! Now that you've gone through the process of constructing a customized plug-in algorithm, you can get to work perfecting your own algorithms, instead of worrying about integration with Analysis Services.