Querying Data Mining Models (Analysis Services - Data Mining)
After you have trained a data mining model, you can explore the model by using the custom viewers provided in SQL Server Management Studio or Business Intelligence Development Studio. However, if you want to make predictions or obtain advanced or more specific information from the model, you must create a query against the data mining model. Queries help you better understand and work with the information in the model in the following ways:
Making single and batch predictions.
Learning more about the patterns found by the model.
Viewing details or the training cases for a particular pattern or subset of the model.
Drilling through to details of the cases in the mining model.
Extracting formulas, rules, or statistics about all or a subset of the model and data.
SQL Server Analysis Services provides a graphical design interface for creating queries, and also a query language called Data Mining Extensions (DMX) that is useful for creating custom predictions and complex queries. To build DMX prediction queries, you can start with the query builders that are available in both SQL Server Management Studio and Business Intelligence Development Studio. A set of DMX query templates is also provided in SQL Server Management Studio.
For more information about how to use the query builder, see Using the Prediction Query Builder to Create DMX Prediction Queries.
For more information about how to use DMX query templates, see Creating DMX Queries in SQL Server Management Studio or How to: Use Templates in SQL Server Management Studio.
This section describes the types of queries that you can create, and the information that each type of query provides. This section also includes examples of queries for specific types of data mining models.
The main goal of many data mining projects is to use the mining models to make predictions. For example, you might want to predict how many products your company will sell next year during December, or whether a potential customer will purchase a product in response to an advertising campaign.
When you create a prediction, you typically provide some piece of new data and ask the model to generate a prediction based on the new data. You can make predictions in a batch, by mapping the model to an external source of data in a prediction join. Alternatively, you can provide values one at a time, by creating a singleton query.
Both singleton and batch prediction queries use the PREDICTION JOIN syntax to define the new data: the difference is in how the input side of the prediction join is specified. In a singleton query, the data is supplied inline as part of the query. In a batch query, the data comes from an external data source that is specified by using the OPENQUERY syntax. For more information, see OPENQUERY (DMX).
In addition, time series models allow you to make predictions based on just the model—you do not need to provide any new data, but just request predictions based on the existing series.
The following section provides information about how to create each of these query types:
Singleton Prediction Query
Make prediction about a single new case, or multiple new cases, that you type into the query.
Map new cases in an external data source to the model and make predictions.
Time Series Predictions
Predict a specified number of future steps, based on an existing model.
Extend an existing model by adding new data, and make predictions based on the composite series.
Apply existing model to new data series by using the REPLACE_MODEL_CASES option.
In addition to predicting a specific value, such as the amount of sales next month, or a product to recommend to a customer, you can customize a prediction query to return various kinds of information that are related to the prediction. For example, it might be helpful to know the probability that the prediction is correct, so that you can decide whether to present the recommendation to the user or not.
To customize the information that is returned by a prediction query, you add prediction functions to the query. Each model or query type supports specific functions. For example, clustering models support special prediction functions that provide extra detail about the groupings created by the model. Time series models support prediction functions that calculate differences over time. There are also general prediction functions that work with almost all model types. For a list of the prediction functions supported in different types of queries, see Mapping Functions to Query Types (DMX). For a complete list of prediction functions, see Data Mining Extensions (DMX) Function Reference.
A singleton query is useful when you want to create simple predictions in real time. For example, you might obtain information from a customer by using a Web site, and then use that data to return predictions, presented as recommendations specific to that customer. Or, you might analyze the contents of an e-mail and then use an existing classification model to assign a category to the e-mail and route the message accordingly.
Singleton queries do not require a separate table that contains input. Instead, you pass a single row of data into the model, and a single prediction is returned in real time. You can also extend the singleton query to make multiple predictions by typing a SELECT statement that contains a single case, adding the UNION operator, and then typing another SELECT statement to specify another case.
You can create singleton queries against an existing model in the following ways:
Using the Data Mining Designer.
Using a singleton query template.
Creating a DMX statement programmatically or in another Analysis Services client.
When you create a singleton query, you must provide the new data to the model in the form of a PREDICTION JOIN. This means that even though you are not mapping to an actual table, you must make sure that the new data matches the existing columns in the mining model. If the new data columns and the new data match exactly, Analysis Services will map the columns for you. This is called a NATURAL PREDICTION JOIN. However, if the columns do not match, or if the new data does not contain the same kind and amount of data that is in the model, you must specify which columns in the model map to the new data, or specify the missing values.
For information about how to create a singleton query in Data Mining Designer, see How to: Create a Singleton Query in the Data Mining Designer and Using the Prediction Query Builder to Create DMX Prediction Queries.
For information about using DMX to create a singleton query, see Prediction Queries (DMX).
For an example of how to use the DMX Query templates in SQL Server Management Studio, see How to: Create a Singleton Prediction Query from a Template.
Batch Prediction Queries
When you perform a prediction join, you map the model to a new data source, and Analysis Services then makes predictions for each row in the new data based on patterns in the model. A prediction join is useful when you have lots of information in a table or other external data source and want to make predictions by using the trained model.
You can create batch prediction queries against an existing model in the following ways:
Using the Data Mining Designer.
Using a template.
Creating a DMX statement programmatically or in another Analysis Services client.
If you create a batch prediction query by using the Data Mining Designer, the external data source must first be defined as a data source view.
If you use DMX to create a prediction join, you can specify the external data source by using the OPENQUERY, OPENROWSET, or SHAPE commands. The default data access method in the DMX templates is OPENQUERY. For information about these methods, see <source data query>.
Regardless of how you define the external data, the source of data that you specify must contain columns with data that is like the data in the model. However, the new information can be incomplete. For example, your customer list might have an Age column but no information about income. Even if income data was used when training the model, you can still map the new data to the model and create a prediction; however, in some cases, the lack of complete information might affect the quality of predictions.
To get the best results, you should join as many of the matching columns as possible between the new data and the model. However, the query will succeed even if there are no matches. If no columns are joined, the query will return the marginal prediction, which is equivalent to the statement SELECT <predictable-column> FROM <model> without a PREDICTION JOIN clause.
Working with the Results of a Prediction Query
A prediction query is not like a query on a relational database. Each prediction function that you add to a query returns its own rowset. Therefore, when you make a prediction on a single case, the result might be a predicted value together with several columns of nested tables containing additional detail.
Whenever you combine multiple functions in one query, the return results are combined as a hierarchical rowset. However, if your provider cannot handle hierarchical rowsets, you can flatten the results by using the FLATTEN keyword in the prediction query.
For more information, including examples of flattened rowsets, see SELECT (DMX).
Predictions in Time Series Mining Models
Time series models provide more flexibility in the way that you can use new data and create predictions. You can either use the model as it is to create predictions, or you can provide new data to the model to update predictions based on recent trends. If you add new data, you can specify the way the new data should be used, either to extend the model cases, or replace the model cases.
When you extend the model cases, you add the new fact data to the time series model and further predictions are based on the new, combined series. When you replace the model cases, you keep the trained model, but replace the underlying cases with a new set of case data.
Regardless of which approach you use, the starting point for predictions is always the end of the original series.
For example, suppose that you have an existing time series model that has been trained on the sales data from the previous year. After you have collected several months of new sales data, you decide to update your sales forecasts for the current year. You can create a prediction join that updates the model by adding new data and extends the model to make new predictions.
Alternatively, you could build your model based on existing data, and then create a prediction join that replaces the case data with new data. This is useful if, for example, one store is missing a lot of data and you want to use a model built on other store data to make predictions. For more information about how to create prediction joins on time series models, see Querying Time Series Models or PredictTimeSeries (DMX).
A content query is a way of extracting information about the internal statistics and structure of the mining model. Sometimes a content query can provide details that are not readily available in the viewer. You can also use the results of a content query to extract information programmatically for other uses. For example, you can extract formulas or probabilities for making your own calculations, or retrieve actionable information about the cases in a model.
This section provides general information about the types of information that you can retrieve by using a content query. Content queries use the DMX syntax shown in the following table:
SELECT FROM <model>.CASES
Find the cases that were used to train or test a model. Drill through into data, including columns in the underlying mining structure.
SELECT FROM <structure>.CASES
View all the data that is included in the structure, including columns that are not included in a particular mining model.
SELECT FROM <model>.CONTENT
Retrieve detailed information about specific nodes in the model, including rules and formulas, support and variance statistics, and so forth.
SELECT FROM <model>.DIMENSIONCONTENT
Supports queries on a data mining dimension.
This query type is principally for internal use. If you develop your own plug-in algorithm, you might use this syntax to test your models.
Not all algorithms support this functionality. Support is indicated by a flag in the MINING_SERVICES schema rowset.
Some model content is standard across algorithms. Generally, however, the content of each model depends on the algorithm that you used to build the model. Therefore, when you create a content query, you must understand what types of information in the model are the most useful.
For example, if you create a query that uses the syntax, SELECT FROM <model>.CONTENT, the query returns very different information depending on whether the model is a sequence clustering model, a decision trees model, or a time series model. For an association model, you might want to retrieve the descriptions of specific rules so that you can use the rules in a custom application, whereas in a time series or sequence clustering model, you might want to find more information about the time patterns detected by the model.
Some examples are provided in the following sections to illustrate the breadth and depth of the information that you can get from a content query; however, for information about mining model content, and the content that is specific to each model type, see Mining Model Content (Analysis Services - Data Mining).
Example 1: Content Query on an Association Model
You can use a SELECT FROM <model>.CONTENT statement to return different kinds of content, depending on the type of model you are querying. For an association model, nodes that represent rules have a NODE_TYPE value of 8, whereas itemsets have a NODE_TYPE value of 7. Thus, the following query would return the top 10 itemsets, ranked by support (the default ordering).
SELECT TOP 10 NODE_DESCRIPTION, NODE_PROBABILITY, SUPPORT FROM <model>.CONTENT WHERE NODE_TYPE = 7
The following example returns three columns: the ID of the node, the complete rule, and the product on the right-hand side of the itemset—that is, the product that is predicted to be associated with some other products as part of an itemset.
The FLATTENED keyword indicates that the nested rowset should be converted into a flat table. The attribute that represents the product on the right-hand side of the rule is contained in the NODE_DISTRIBUTION table; therefore, we retrieve only the row that contains an attribute name by adding a requirement that the length be greater than 2. A simple string function is used to remove the name of the model from the third column. Usually the model name is always prefixed to the values of nested columns. The WHERE clause specifies that the value of NODE_TYPE should be 8, to retrieve only rules.
SELECT FLATTENED NODE_UNIQUE_NAME , NODE_DESCRIPTION, (SELECT RIGHT(ATTRIBUTE_NAME, (LEN(ATTRIBUTE_NAME)-LEN('Association model name'))) FROM NODE_DISTRIBUTION WHERE LEN(ATTRIBUTE_NAME)>2 ) AS RightSideProduct FROM [<Association model name>].CONTENT WHERE NODE_TYPE = 8 ORDER BY NODE_SUPPORT DESC
For more examples, see Querying an Association Model (Analysis Services - Data Mining).
Example 2: Content Query on a Decision Trees Model
One scenario where querying the model content is valuable is when you want to follow up on a prediction query by returning the rule that explains why that state was predicted. For example, you can add the prediction function PredictNodeId (DMX) to a query to obtain the ID of the node that contains the rule, by using the following syntax:
SELECT Predict([Bike Buyer]), PredictNodeID([Bike Buyer]) FROM [<decision tree model name>] PREDICTION JOIN <input rowset>
For a decision tree model, the caption contains the description of the path to the outcome. Therefore, once you have the ID of the node that contains the outcome, you can retrieve the rule or path that explains the prediction by creating a content query, such as the following:
SELECT NODE_CAPTION FROM [<decision tree model name>] WHERE NODE_UNIQUE_NAME= '<node id>'
For more examples, see Querying a Decision Trees Model (Analysis Services - Data Mining).
Analysis Services provides a variety of data definition statements for creating and managing mining structure and models. For more information, see Managing Data Mining Structures and Models.