Querying a Logistic Regression Model (Analysis Services - Data Mining)
When you create a query against a data mining model, you can create a content query, which provides details about the patterns discovered in analysis, or you can create a prediction query, which uses the patterns in the model to make predictions using new data.
This section explains how to create queries for models that are based on the Microsoft Logistic Regression algorithm.
Logistic regression models are created by using the Microsoft Neural Network algorithm with a special set of parameters; therefore, a logistic regression model has some of the same information as a neural networks model, but is less complex. To understand the structure of the model content, and which node types store what kind of information, see Mining Model Content for Logistic Regression Models (Analysis Services - Data Mining).
To follow along in the query scenarios, you can create a logistic regression model as described in the following section of the Intermediate Data Mining Tutorial: Lesson 5: Building Neural Network and Logistic Regression Models (Intermediate Data Mining Tutorial).
You can also use the mining structure, Targeted Mailing, from the Basic Data Mining Tutorial.
ALTER MINING STRUCTURE [Targeted Mailing] ADD MINING MODEL [TM_Logistic Regression] ([Customer Key], [Age], [Bike Buyer] PREDICT, [Yearly Income] PREDICT, [Commute Distance], [English Education], Gender, [House Owner Flag], [Marital Status], [Number Cars Owned], [Number Children At Home], [Region], [Total Children] ) USING Microsoft_Logistic_Regression
Sample Query 1: Retrieving Model Parameters by Using the Data Mining Schema Rowset
By querying the data mining schema rowset, you can find metadata about the model, such as when it was created, when the model was last processed, the name of the mining structure that the model is based on, and the name of the column used as the predictable attribute. The following example returns the parameters that were used when the model was first created, together with the name and type of the model, and the date that it was created.
SELECT MODEL_NAME, SERVICE_NAME, DATE_CREATED, MINING_PARAMETERS FROM $system.DMSCHEMA_MINING_MODELS WHERE MODEL_NAME = 'Call Center_LR'
HOLDOUT_PERCENTAGE=30, HOLDOUT_SEED=1, MAXIMUM_INPUT_ATTRIBUTES=255, MAXIMUM_OUTPUT_ATTRIBUTES=255, MAXIMUM_STATES=100, SAMPLE_SIZE=10000
Sample Query 2: Finding Additional Detail about the Model by Using DMX
The following query returns some basic information about the logistic regression model. A logistic regression model is similar to a neural network model in many ways, including the presence of a marginal statistic node (NODE_TYPE = 24) that describes the values used as inputs. This example query uses the Targeted Mailing model, and gets the values of all the inputs by retrieving them from the nested table, NODE_DISTRIBUTION.
SELECT FLATTENED NODE_DISTRIBUTION AS t FROM [TM_Logistic Regression].CONTENT
The actual query returns many more rows; however, this sample illustrates the type of information that is provided about the inputs. For discrete inputs, each possible value is listed in the table. For continuous-value inputs such as Age, a complete listing is impossible, so the input is discretized as a mean. For more information about how to use the information in the marginal statistics node, see Mining Model Content for Logistic Regression Models (Analysis Services - Data Mining).
The results have been flattened for easier viewing, but you can return the nested table in a single column if your provider supports hierarchical rowsets. For more information, see Hierarchical Rowsets in the OLE DB Programmer's Guide.
You can use the Predict (DMX) function with every kind of mining model to provide new data to the model and make predictions based on the new values. You can also use functions to return additional information about the prediction, such as the probability that a prediction is correct. This section provides some examples of prediction queries on a logistic regression model.
Sample Query 3: Making Predictions for a Continuous Value
Because logistic regression supports the use of continuous attributes for both input and prediction, it is easy to create models that correlate various factors in your data. You can use prediction queries to explore the relationship among these factors.
The following query sample is based on the Call Center model, from the Intermediate Tutorial, and creates a singleton query that predicts service grade for the Friday AM shift. The PredictHistogram (DMX) function returns a nested table that provides statistics relevant to understanding the validity of the predicted value.
SELECT Predict([Call Center_LR].[Service Grade]) as Predicted ServiceGrade, PredictHistogram([Call Center_LR].[Service Grade]) as [Results], FROM [Call Center_LR] NATURAL PREDICTION JOIN (SELECT 'Friday' AS [Day Of Week], 'AM' AS [Shift]) AS t
Predicted Service Grade
For more information about the probability, support, and standard deviation values in the nested NODE_DISTRIBUTION table, see Mining Model Content for Logistic Regression Models (Analysis Services - Data Mining).
Sample Query 4: Making Predictions for a Discrete Value
Logistic regression is typically used in scenarios where you want to analyze the factors that contribute to a binary outcome. Although the model used in the tutorial predicts a continuous value, ServiceGrade, in a real-life scenario you might want to set up the model to predict whether service grade met some discretized target value. Alternatively, you could output the predictions using a continuous value but later group the predicted outcomes into Good, Fair, or Poor.
The following sample illustrates how to change the way that the predictable attribute is grouped. To do this, you create a copy of the mining structure and then change the discretization method of the target column so that the values are grouped rather than continuous.
The following procedure describes how to change the grouping of Service Grade values in the Call Center data.
To create a discretized version of the Call Center mining structure and models
In Business Intelligence Development Studio, in Solution Explorer, expand Mining Structures.
Right-click Call Center.dmm and select Copy.
Right click Mining Structures and select Paste. A new mining structure iss added, named Call Center 1.
Right-click the new mining structure and select Rename. Type the new name, Call Center Discretized.
Double-click the new mining structure to open it in the designer. Notice that the mining models have all been copied as well, and all have the extension 1. Leave the names as is for now.
In the Mining Structure tab, right-click the column for Service Grade, and select Properties.
Change the Content property from Continuous to Discretized. Change the DiscretizationMethod property to Clusters. For Discretization BucketCount, type 3.
These parameters are just used for illustrating the process, and do not necessarily produce a valid model,
From the Mining Model menu, select Process structure and all models.
The following sample query is based on this discretized model, and predicts the service grade for the specified day of the week, together with the probabilities for each predicted outcome.
SELECT (PredictHistogram([Call Center_LR 1].[Service Grade])) as [Predictions] FROM [Call Center_LR 1] NATURAL PREDICTION JOIN (SELECT 'Saturday' AS [Day Of Week]) AS t
Note that the predicted outcomes have been grouped into three categories as specified; however, these groupings are based on the clustering of actual values in the data, not arbitrary values that you might set as business goals.
All Microsoft algorithms support a common set of functions. However, the Microsoft Logistic Regression algorithm supports the additional functions listed in the following table.
For a list of the functions that are common to all Microsoft algorithms, see Mapping Functions to Query Types (DMX). For the syntax of specific functions, see Data Mining Extensions (DMX) Function Reference.
For neural network and logistic regression models, the PredictSupport (DMX) function returns a single value that represents the size of the training set for the entire model.