Mining Model Content (Analysis Services - Data Mining)
After you have designed and processed a mining model using data from the underlying mining structure, the mining model is complete and contains mining model content. You can use this content to make predictions or analyze your data.
Mining model content includes metadata about the model, statistics about the data, and patterns discovered by the mining algorithm. Depending on the algorithm that was used, the model content may include regression formulas, the definitions of rules and itemsets, or weights and other statistics.
Regardless of the algorithm that was used, mining model content is presented in a standard structure. You can browse the structure in the Microsoft Generic Content Tree Viewer, provided in Business Intelligence Development Studio, and then switch to one of the custom viewers to see how the information is interpreted and displayed graphically for each model type. You can also create queries against the mining model content by using any client that supports the MINING_MODEL_CONTENT schema rowset. For more information, see Querying Data Mining Models: How-to Topics (Analysis Services - Data Mining).
This section describes the basic structure of the content provided for all kinds of mining models. It describes the node types that are common to all mining model content, and provides guidance on how to interpret the information.
The content of each model is presented as a series of nodes. A node is an object within a mining model that contains metadata and information about a portion of the model. Nodes are arranged in a hierarchy. The exact arrangement of nodes in the hierarchy, and the meaning of the hierarchy, depends on the algorithm that you used. For example, if you create a decision trees model, the model can contain multiple trees, all connected to the model root; if you create a neural network model, the model may contain one or more networks, plus a statistics node.
The first node in each model is called the root node, or the model parent node. Every model has a root node (NODE_TYPE = 1). The root node typically contains some metadata about the model, and the number of child nodes, but little additional information about the patterns discovered by the model.
Depending on which algorithm you used to create the model, the root node has a varying number of child nodes. Child nodes have different meanings and contain different content, depending on the algorithm and the depth and complexity of the data.
In a mining model, a node is a general-purpose container that stores a piece of information about all or part of the model. The structure of each node is always the same, and contains the columns defined by the data mining schema rowset. For more information, see DMSCHEMA_MINING_MODEL_CONTENT Rowset.
Each node includes metadata about the node, including an identifier that is unique within each model, the ID of the parent node, and the number of child nodes that the node has. The metadata identifies the model to which the node belongs, and the database catalog where that particular model is stored. Additional content provided in the node differs depending on the type of algorithm you used to create the model, and might include the following:
Count of cases in the training data that supports a particular predicted value.
Statistics, such as mean, standard deviation, or variance.
Coefficients and formulas.
Definition of rules and lateral pointers.
XML fragments that describe a portion of the model.
List of Node Types
The following table lists the different types of nodes that are output in data mining models. Because each algorithm processes information differently, each model generates only a few specific kinds of nodes. If you change the algorithm, the type of nodes may change. Also, if you reprocess the model, the content of each node may change.
If you use a different data mining service than the ones provided in SQL Server 2008 Analysis Services, or if you create your own plug-in algorithms, additional custom node types may be available.
Metadata and root content node. Applies to all model types.
Root node of a classification tree. Applies to decision tree models.
Interior split node in a tree. Applies to decision tree models.
Terminal node of a tree. Applies to decision tree models.
Cluster detected by the algorithm. Applies to clustering models and sequence clustering models.
Unknown node type.
Itemset detected by the algorithm. Applies to association models or sequence clustering models.
Association rule detected by the algorithm. Applies to association models or sequence clustering models.
Predictable attribute. Applies to all model types.
Input attribute. Applies to decision trees and Naïve Bayes models.
Statistics about the states of an input attribute. Applies to decision trees and Naïve Bayes models.
Top node for a Markov model component of a sequence cluster. Applies to sequence clustering models.
Markov transition matrix. Applies to sequence clustering models.
Non-root node of a time series tree. Applies to time series models only.
Root node of a time series tree that corresponds to a predictable time series. Applies to time series models, and only if the model was created using the MIXED parameter.
One sub-network. Applies to neural network models.
Group that contains the nodes of the input layer. Applies to neural network models.
Groups that contains the nodes that describe the hidden layer. Applies to neural network models.
Groups that contains the nodes of the output layer. Applies to neural network models.
Node in the input layer that matches an input attribute with the corresponding states. Applies to neural network models.
Node in the hidden layer. Applies to neural network models.
Node in the output layer. This node will usually match an output attribute and the corresponding states. Applies to neural network models.
Marginal statistics about the training set. Applies to neural network models.
Root of a regression tree. Applies to linear regression models and to decision trees models that contains continuous input attributes.
Marginal statistics about the training set. Applies to Naïve Bayes models.
Root node of an ARIMA model. Applies to only those time series models that use the ARIMA algorithm.
A periodic structure in an ARIMA model. Applies to only those time series models that use the ARIMA algorithm.
Autoregressive coefficient for a single term in an ARIMA model.
Applies to only those time series models that use the ARIMA algorithm.
Moving average coefficient for a single term in an ARIMA model. Applies to only those time series models that use the ARIMA algorithm.
Starting point for custom node types. Custom node types must be integers greater in value than this constant. Applies to models created by using custom plug-in algorithms.
Node ID, Name, Caption and Description
The root node of any model always has the unique ID (NODE_UNIQUE_NAME) of 0. All node IDs are assigned automatically by Analysis Services and cannot be modified.
The root node for each model also contains some basic metadata about the model. This metadata includes the Analysis Services database where the model is stored (MODEL_CATALOG), the schema (MODEL_SCHEMA), and the name of the model (MODEL_NAME). However, this information is repeated in all the nodes of the model, so you do not need to query the root node to get this metadata.
In addition to a name used as the unique identifier, each node has a name (NODE_NAME). This name is automatically created by the algorithm for display purposes and cannot be edited.
The Microsoft Clustering algorithm allows users to assign friendly names to each cluster. However, these friendly names are not persisted on the server, and if you reprocess the model, the algorithm will generate new cluster names.
The caption and description for each node are automatically generated by the algorithm, and serve as labels to help you understand the content of the node. The text generated for each field depends on the model type. In some cases, the name, caption, and description may contain exactly the same string, but in some models, the description may contain additional information. See the topic about the individual model type for details of the implementation.
Analysis Services server supports the renaming of nodes only if you build models by using a custom plug-in algorithm that implements renaming,. To enable renaming, you must override the methods when you create the plug-in algorithm.
Node Parents, Node Children, and Node Cardinality
The relationship between parent and child nodes in a tree structure is determined by the value of the PARENT_UNIQUE_NAME column. This value is stored in the child node and tells you the ID of the parent node. Some examples follow of how this information might be used:
A PARENT_UNIQUE_NAME that is NULL means that the node is the top node of the model.
If the value of PARENT_UNIQUE_NAME is 0, the node must be a direct descendant of the top node in the model. This is because the ID of the root node is always 0.
You can use functions within a Data Mining Extensions (DMX) query to find descendants or parents of a particular node. For more information about using functions in queries, see Querying Data Mining Models (Analysis Services - Data Mining).
Cardinality refers to the number of items in a set. In the context of a processed mining model, cardinality tells you the number of children in a particular node. For example, if a decision tree model has a node for [Yearly Income], and that node has two child nodes, one for the condition [Yearly Income] = High and one for the condition, [Yearly Income] = Low, the value of CHILDREN_CARDINALITY for the [Yearly Income] node would be 2.
In Analysis Services, only the immediate child nodes are counted when calculating the cardinality of a node. However, if you create a custom plug-in algorithm, you can overload CHILDREN_CARDINALITY to count cardinality differently. This may be useful, for example, if you wanted to count the total number of descendants, not just the immediate children.
Although cardinality is counted in the same way for all models, how you interpret or use the cardinality value differs depending on the model type. For example, in a clustering model, the cardinality of the top node tells you the total number of clusters that were found. In other types of models, cardinality may always have a set value depending on the node type. For more information about how to interpret cardinality, see the topic about the individual model type.
Some models, such as those created by the Microsoft Neural Network algorithm, additionally contain a special node type that provides descriptive statistics about the training data for the entire model. By definition, these nodes never have child nodes.
The NODE_DISTRIBUTION column contains a nested table that in many nodes provides important and detailed information about the patterns discovered by the algorithm. The exact statistics provided in this table change depending on the model type, the position of the node in the tree, and whether the predictable attribute is a continuous numeric value or a discrete value; however, they can include the minimum and maximum values of an attribute, weights assigned to values, the number of cases in a node, coefficients used in a regression formula, and statistical measures such as standard deviation and variance. For more information about how to interpret node distribution, see the topic for the specific type of model type that you are working with.
The NODE_DISTRIBUTION table may be empty, depending on the node type. For example, some nodes serve only to organize a collection of child nodes, and it is the child nodes that contain the detailed statistics.
The nested table, NODE_DISTRIBUTION, always contains the following columns. The content of each column varies depending on the model type. For more information about specific model types, see Mining Model Content by Algorithm Type.
The meaning of the node score differs depending on the model type, and can be specific to the node type as well. For information about how NODE_SCORE is calculated for each model and node type, see Mining Model Content by Algorithm Type.
Node Probability and Marginal Probability
The mining model schema rowset includes the columns NODE_PROBABILITY and MARGINAL_PROBABILITY for all model types. These columns contain values only in nodes where a probability value is meaningful. For example, the root node of a model never contains a probability score.
In those nodes that do provide probability scores, the node probability and marginal probabilities represent different calculations.
Marginal probability is the probability of reaching the node from its parent.
Node probability is the probability of reaching the node from the root.
Node probability is always less than or equal to marginal probability.
For example, if the population of all customers in a decision tree is split equally by gender (and no values are missing), the probability of the child nodes should be .5. However, suppose that each of the nodes for gender is divided equally by income levels—High, Medium, and Low. In this case the MARGINAL_PROBABILITY score for each child node should always be .33 but the NODE_PROBABILTY value will be the product of all probabilities leading to that node and thus always less than the MARGINAL_PROBABILITY value.
Level of node/attribute and value
All target customers
Target customers split by gender
Target customers split by gender, and split again three ways by income
.5 * .33 = .165
Node Rule and Marginal Rule
The mining model schema rowset also includes the columns NODE_RULE and MARGINAL_RULE for all model types. These columns contain XML fragments that can be used to serialize a model, or to represent some part of the model structure. These columns may be blank for some nodes, if a value would be meaningless.
Two kinds of XML rules are provided, similar to the two kinds of probability values. The XML fragment in MARGINAL_RULE defines the attribute and value for the current node, whereas the XML fragment in NODE_RULE describes the path to the current node from the model root.
Each algorithm stores different types of information as part of its content schema. For example, the Microsoft Clustering Algorithm generates many child nodes, each of which represents a possible cluster. Each cluster node contains rules that describe characteristics shared by items in the cluster. In contrast, the Microsoft Linear Regression algorithm does not contain any child nodes; instead, the parent node for the model contains the equation that describes the linear relationship discovered by analysis.
The following table provides links to topics for each type of algorithm.
Model content topics: Explain the meaning of each node type for each algorithm type, and provide guidance about which nodes are of most interest in a particular model type.
Querying topics: Provide examples of queries against a particular model type and guidance on how to interpret the results.
Algorithm or Model Type
Querying Mining Models
Association rules models
Decision trees model
Linear regression models
Logistic regression models
Naïve Bayes models
Neural network models
Time series models
When you browse or explore a model in Business Intelligence Development Studio, you can view the information in the Microsoft Generic Content Tree Viewer, which is available in both Business Intelligence Development Studio and SQL Server Management Studio.
The Microsoft Generic Content Viewer displays the columns, rules, properties, attributes, nodes, and other content from the model by using the same information that is available in the content schema rowset of the mining model. The content schema rowset is a generic framework for presenting detailed information about the content of a data mining model. You can view model content in any client that supports hierarchical rowsets. The viewer in Business Intelligence Development Studio presents this information in an HTML table viewer that represents all models in a consistent format, making it easier to understand the structure of the models that you create. For more information, see Viewing Model Details with the Microsoft Generic Content Tree Viewer.
To retrieve mining model content, you must create a query against the data mining model.
The easiest way to create a content query is to execute the following DMX statement in SQL Server Management Studio:
SELECT * FROM [<mining model name>].CONTENT
For more information, see Querying Data Mining Models (Analysis Services - Data Mining).
You can also query the mining model content by using the data mining schema rowsets. A schema rowset is a standard structure that clients use to discover, browse, and query information about mining structures and models. You can query the schema rowsets by using XMLA, Transact-SQL, or DMX statements.
In SQL Server 2008, you can also access the information in the data mining schema rowsets by opening a connection to the Analysis Services server and querying the system tables. For more information about using SELECT statements to query data mining schema rowsets, see Troubleshooting Tools (Analysis Services - Data Mining).