Content Queries (Data Mining)
A content query is a way of extracting information about the internal statistics and structure of the mining model. Sometimes a content query can provide details that are not readily available in the viewer. You can also use the results of a content query to extract information programmatically for other uses.
This section provides general information about the types of information that you can retrieve by using a content query, and the general DMX syntax for content queries.
You can create content queries by using the Prediction Query Builder, use the DMX content query templates that are provided in SQL Server Management Studio, or write queries directly in DMX. Unlike prediction queries you do not need to join external data, so content queries are easy to write.
This section provides an overview of the types of content queries you can create.
Queries on the mining structure or case data let you view the detailed data that was used for training.
Queries on the model can return patterns, lists of attributes, formulas, and so forth.
Queries on Structure and Case Data
DMX supports queries on the cached data that is used to build mining structures and models. By default, this cache is created when you define the mining structure, and is populated when you process the structure or model.
This cache cannot be cleared or deleted if you need to separate data into training and testing sets. If the cache is cleared, you cannot query the case data.
The following examples show the common patterns for creating queries on the case data, or queries on the data in the mining structure:
Queries on Model Patterns, Statistics, and Attributes
The content of a data mining model is useful for many purposes. With a model content query, you can:
Extract formulas or probabilities for making your own calculations.
For an association model, retrieve the rules that are used to generate a prediction.
Retrieve the descriptions of specific rules so that you can use the rules in a custom application.
View the moving averages detected by a time series model.
Obtain the regression formula for some segment of the trend line.
Retrieve actionable information about customers identified as being part of a specific cluster.
The following examples show some of the common patterns for creating queries on the model content:
Although some model content is standard across algorithms, some parts of the content vary greatly depending on the algorithm that you used to build the model. Therefore, when you create a content query, you must understand what information in the model is most useful to your specific model.
A few examples are provided in this section to illustrate how the choice of algorithm affects the kind of information that is stored in the model. For more information about mining model content, and the content that is specific to each model type, see Mining Model Content (Analysis Services - Data Mining).
Example 1: Content Query on an Association Model
The statement, SELECT FROM <model>.CONTENT, returns different kinds of information, depending on the type of model you are querying. For an association model, a key piece of information is the node type. Nodes are like containers for information in the model content. In an association model, nodes that represent rules have a NODE_TYPE value of 8, whereas nodes that represent itemsets have a NODE_TYPE value of 7.
Thus, the following query returns the top 10 itemsets, ranked by support (the default ordering).
SELECT TOP 10 NODE_DESCRIPTION, NODE_PROBABILITY, SUPPORT FROM <model>.CONTENT WHERE NODE_TYPE = 7
The following query builds on this information. The query returns three columns: the ID of the node, the complete rule, and the product on the right-hand side of the itemset—that is, the product that is predicted to be associated with some other products as part of an itemset.
SELECT FLATTENED NODE_UNIQUE_NAME, NODE_DESCRIPTION, (SELECT RIGHT(ATTRIBUTE_NAME, (LEN(ATTRIBUTE_NAME)-LEN('Association model name'))) FROM NODE_DISTRIBUTION WHERE LEN(ATTRIBUTE_NAME)>2 ) AS RightSideProduct FROM [<Association model name>].CONTENT WHERE NODE_TYPE = 8 ORDER BY NODE_SUPPORT DESC
The FLATTENED keyword indicates that the nested rowset should be converted into a flattened table. The attribute that represents the product on the right-hand side of the rule is contained in the NODE_DISTRIBUTION table; therefore, we retrieve only the row that contains an attribute name, by adding a requirement that the length be greater than 2.
A simple string function is used to remove the name of the model from the third column. (Usually the model name is prefixed to the values of nested columns.)
The WHERE clause specifies that the value of NODE_TYPE should be 8, to retrieve only rules.
For more examples, see Association Model Query Examples.
Example 2: Content Query on a Decision Trees Model
A decision tree model can be used for prediction as well as for classification. This example assumes that you are using the model to predict an outcome, but you also want to find out which factors or rules can be used to classify the outcome.
In a decision tree model, nodes are used to represent both trees and leaf nodes. The caption for each node contains the description of the path to the outcome. Therefore, to trace the path for any particular outcome, you need to identify the node that contains it, and get the details for that node.
In your prediction query, you add the prediction function PredictNodeId (DMX), to get the ID of the related node, as shown in the following example:
SELECT Predict([Bike Buyer]), PredictNodeID([Bike Buyer]) FROM [<decision tree model name>] PREDICTION JOIN <input rowset>
Once you have the ID of the node that contains the outcome, you can retrieve the rule or path that explains the prediction by creating a content query that includes the NODE_CAPTION, as follows:
SELECT NODE_CAPTION FROM [<decision tree model name>] WHERE NODE_UNIQUE_NAME= '<node id>'
For more examples, see Decision Trees Model Query Examples.
As the examples demonstrate, content queries mostly return tabular rowsets, but can also include information from nested columns. You can flatten the rowset that is returned, but this can make working with results more complex. The content of the NODE_DISTRIBUTION node in particular is nested, but contains much interesting information about the model.
For more information about how to work with hierarchical rowsets, see the OLEDB specification on MSDN.