Mining Model Content for Decision Tree Models (Analysis Services - Data Mining)
This topic describes mining model content that is specific to models that use the Microsoft Decision Trees algorithm. For a general explanation of mining model content for all model types, see Mining Model Content (Analysis Services - Data Mining). It is important to remember that The Microsoft Decision Trees algorithm is a hybrid algorithm that can create models with very different functions: a decision tree can represent associations, rules, or even linear regression. The structure of the tree is essentially the same, but how you interpret the information will depend on the purpose for which you created the model.
A decision trees model has a single parent node that represents the model and its metadata. Underneath the parent node are independent trees that represent the predictable attributes that you select. For example, if you set up your decision tree model to predict whether customers will purchase something, and provide inputs for gender and income, the model would create a single tree for the purchasing attribute, with many branches that divide on conditions related to gender and income.
However, if you then add a separate predictable attribute for participation in a customer rewards program, the algorithm will create two separate trees under the parent node. One tree contains the analysis for purchasing, and another tree contains the analysis for the customer rewards program. If you use the Decision Trees algorithm to create an association model, the algorithm creates a separate tree for each product that is being predicted, and the tree contains all the other product combinations that contribute towards selection of the target attribute.
If your model includes multiple trees, you can view only one tree at a time in the Microsoft Tree Viewer. However, in the Generic Content Tree Viewer , all trees in the same model are displayed at the same time.
The tree for each predictable attribute contains information that describes how the input columns that you choose affect the outcome of that particular predictable attribute. Each tree is headed by a node (NODE_TYPE = 9) that contains the predictable attribute, followed by a series of nodes (NODE_TYPE = 10) that represent the input attributes. An attribute corresponds to either a case-level column or values of nested table columns, which are generally the values in the Key column of the nested table.
Interior and leaf nodes represent split conditions. A tree can split on the same attribute multiple times. For example, the TM_DecisionTree model might split on [Yearly Income] and [Number of Children], and then split again on [Yearly Income] further down the tree.
The Microsoft Decision Trees algorithm can also contain linear regressions in all or part of the tree. If the attribute that you are modeling is a continuous numeric data type, the model can create a regression tree node (NODE_TYPE = 25) wherever the relationship between the attributes can be modeled linearly. In this case, the node contains a regression formula.
However, if the predictable attribute has discrete values, or if numeric values have been bucketed or discretized, the model always creates a classification tree (NODE_TYPE =2). A classification tree can have multiple branches or interior tree nodes (NODE_TYPE =3) for each value of the attribute. However, the split is not necessarily on each value of the attribute.
The Microsoft Decision Trees algorithm does not allow continuous data types as inputs; therefore, if any columns have a continuous numeric data type, the values are discretized. The algorithm performs its own discretization at the point of a split for all continuous attributes.
Analysis Services automatically chooses a method for bucketing continuous attributes; however, you can control how continuous values in the inputs are discretized by setting the content type of the mining structure column to Discretized and then setting the DiscretizationBucketCount or DiscretizationMethod property.
This section provides details and examples only for those columns in the mining model content that have particular relevance for decision trees models. For information about general-purpose columns in the schema rowset, and explanations of mining model terminology, see Mining Model Content (Analysis Services - Data Mining).
A decision trees model does not have a separate node that stores statistics for the entire model, unlike the marginal statistics node found in a Naive Bayes or neural network model. Instead, the model creates a separate tree for each predictable attribute, with an (All) node at the top of the tree. Each tree is independent of the others. If your model contains only one predictable attribute, there is only one tree, and therefore only one (All) node.
Each tree that represents an output attribute is additionally subdivided into interior branches (NODE_TYPE = 3) that represent splits. Each of these trees contains statistics about the distribution of the target attribute. In addition, each leaf node (NODE_TYPE = 4) contains statistics that describe input attributes and their values, together with the number of cases in support of each attribute-value pair. Therefore, in any branch of a decision tree, you can view the probabilities or the distribution of data easily without having to query the source data. Each level of the tree necessarily represents the sum of its immediate child nodes.
For examples of how to retrieve these statistics, see Querying a Decision Trees Model (Analysis Services - Data Mining).
To understand how a decision tree works, consider an example, such as the AdventureWorks bike buyer scenario. Assuming that the predictable attribute is customer purchases, the decision trees algorithm tries to find one column of data, among all the inputs that you provided, that most effectively detects the customers that are likely to purchase a bike and those who are unlikely to buy a bike. For example, the model might find that Age is the best indicator of purchasing behavior. Specifically, that the customers over the age of 30 are very likely to purchase a bike, and all other customers are unlikely to make a purchase. In this scenario, the model creates a split on the Age attribute. That means that the tree divides into two branches, one containing customers over the age of 30, and the other containing customers under 30. The new branches are represented in the model structure as two new interior trees (NODE_TYPE = 3).
For each branch, the model continues to look for additional attributes to use in differentiating customers. If there is insufficient evidence in the data to continue creating subgroups of customers, the model stops building the tree. The model will also stop building the tree whenever the number of cases in the node is too small to continue, regardless of how good the split is, or if the value is null or missing. By stopping the growth of the tree early, you prevent the model from training too closely to one particular set of data.
Each interior tree node contains leaf nodes that provide a breakdown of the outcomes given the current classification results. For example, you might have an interior node that represents Age >= 30 and Gender = Male. The node for this group shows you how many customers in this category purchased or did not purchase something. For example, the classification might contain the following tree splits:
Age >= 30
Age >= 30 and Gender = Male
Age >= 30 and Gender = Female
Age < 30
Age < 30 and Gender = Male
Age < 30 and Gender = Female
When you use a decision tree model for prediction, the model takes the attributes that you provide to it as arguments and follows the path of the attributes down through the tree. In general, all predictions go to a leaf, and the interior nodes are used only for classification.
A leaf node always has a NODE_TYPE of 4 (Distribution) and contains a histogram that tells the probability of each outcome (purchase or not purchase) given the attributes you provide. For example, if you ask for a prediction for a new customer who is a male over 60, the model will look up the corresponding node (Age > 30 and Gender = Male) and then return the probability for the outcome that you specify. These probabilities are stored in the NODE_DISTRIBUTION table for the node.
If the predictable attribute is a continuous number, the algorithm tries to create a regression formula that models the relationship between the predictable attribute and the inputs.
Node Caption and Node Description
In a decision tree model, the node caption and node description contain similar information. However, the node description is more complete and contains more information as you move closer to the leaf nodes. Both the node caption and node description are localized strings.
Displays the attribute that distinguishes that particular node relative to the parent node. The node caption defines a sub-segment of the population based the split condition. For example, if the split was on [Age] and it was a three-way split, the node captions for the three child nodes might be "[Age] < 40", "40 <= [Age] < 50", "[Age] >= 50".
Contains a full list of the attributes that distinguish that node from other nodes, starting from the model parent node. For example, Product name = Apple and Color = Red.
Node Rule and Marginal Rule
The NODE_RULE and MARGINAL_RULE columns contain the same information as the NODE_CAPTION and NODE_DESCRIPTION columns, but represent the information as XML fragments. The node rule is an XML version of the full path, whereas the marginal rule indicates the most recent split.
The attribute represented by the XML fragment can be either simple or complex. A simple attribute contains the name of the model column, and the value of the attribute. If the model column contains a nested table, the nested table attribute is represented as a concatenation of the table name, the key value, and the attribute.
SQL ServerAnalysis Services supports version 2.0 of the PMML standard, with extensions to support the use of nested table. If your data contains nested tables and you generate a PMML version of the model, all elements in the model that include the predicates are marked as an extension.
Node Distribution for Discrete Attributes
In a decision trees model, the NODE_DISTRIBUTION table contains useful statistics. However, the type of statistics depends on whether the tree predicts a discrete or continuous attribute. This section describes the meaning of the node distribution statistics for discrete attributes.
Attribute Name and Attribute Value
In a classification tree, the attribute name always contains the name of the predictable column. This value tells you what the tree predicts. Because a single tree always represents a single predictable attribute, this value is repeated throughout the tree.
For a discrete data type, the attribute value field lists the possible values of the predictable column, plus the Missing value.
The support value for each node tells you how many cases are included in this node. At the (All) level, you should see the complete count of cases that were used to train the model. For each split in the tree, the support value is the count of cases that were grouped into that node of the tree. The sum of cases in the leaf nodes necessarily equals the count of cases in the parent node of the tree.
For nodes that represent continuous attributes, the presence of nulls in the data might lead to some counterintuitive results. For example, if there are m cases, a mean value would be calculated as sum(all cases)/n, where n is a number less than m, and m-n indicates the count of cases with missing values. Support is also represented as n.
The probability associated with each node tells you the probability that any case in the whole data set would end up in this particular node. Probability scores are computed both for the tree as a whole, and for the immediate split.
For example, the following table shows a very simple model, with 100 cases.
Probability relative to parent node
Probability relative to top node
Age >= 30
Age >= 30 and Gender = Male
50/60 = .83
50/100 = .5
Age >= 30 and Gender = Female
10/60 = .16
10/100 = .10
Age < 30
Age < 30 and Gender = Male
30/40 = .75
30/100 = .30
Age < 30 and Gender = Female
10/40 = .25
10/100 = .10
A small adjustment is made in all models to account for possible missing values. For continuous attributes, each value or range of values is represented as a state (for example, Age <30, Age = 30, and Age >30) and the probabilities are calculated as follows: state exists (value = 1), some other state exists (value = 0), state is Missing. For more information about how probabilities are adjusted to represent missing values, see Missing Values (Analysis Services - Data Mining).
The probabilities for each node are calculated almost directly from the distribution, as follows:
Probability = (support for state + support for prior state) / (node support plus the prior node support)
Analysis Services uses probabilities for each node to compare the stored probability with the prior probability to determine whether the path from the parent to the child node indicates a strong inference.
When making predictions, the probability of the distribution must be balanced with the probability of the node, to smoothen the probabilities. For example, if a split in the tree separates cases by a ratio of 9000/1000, the tree is very unbalanced. As a result, a prediction coming from the small branch should not carry the same weight as a prediction coming from a branch with many cases.
Variance is a measure of how scattered values in a sample are, given an expected distribution. For discrete values, the variance is 0 by definition.
For information about how variance is calculated for continuous values, see Mining Model Content for Linear Regression Models (Analysis Services - Data Mining).
The value type column provides information about the meaning of the numeric value provided in the other columns in the NODE_DISTRIBUTION table. You can use the value type in queries to retrieve specific rows from the nested tables. For examples, see Querying a Decision Trees Model (Analysis Services - Data Mining).
Of the types in the MiningValueType enumeration, the following are used in classification trees.
Indicates a count, probability, or other statistic related to missing values.
Indicates a count, probability, or other statistic related to a discrete or discretized value.
If the model includes a continuous predictable attribute, the tree might also contain value types that are unique to regression formulas. For a list of the value types that are used in regression trees, see Mining Model Content for Linear Regression Models (Analysis Services - Data Mining).
The node score represents slightly different information at each level of the tree. In general, the score is a numeric value that tells you how good a split was achieved by splitting on the condition. The value is represented as a double, where a higher value is better.
By definition, the model node and all leaf nodes have a node score of 0.
For the (All) node that represents the top of each tree, the MSOLAP_NODE_SCORE column contains the best split score in the whole tree.
For all other nodes in the tree (except leaf nodes), the score for each node represents the best split score for the current node, minus the split score for the parent node. Typically, the split score for a parent node should always be better than the split score on any one of its child nodes. That is because a decision trees model ideally splits on the most important attributes first.
There are many ways of calculating a score for a split, depending on the algorithm parameter you choose. A discussion of how the scores are calculated for each of the scoring methods is beyond the scope of this topic. For more information, see "Learning Bayesian Networks: The Combination of Knowledge and Statistical Data", on the Microsoft Research Web site.
If you create a decision trees model that has both continuous and discrete predictable attributes, you will see completely different scores in the (All) nodes that represent each tree type. Each model should be considered independently, and the methods used for scoring regression are completely different from those used for scoring classification. The node score values cannot be compared.
If a decision trees model contains a predictable attribute with continuous numeric data, the Microsoft Decision Trees algorithm seeks to find areas in the data where the relationship between the predicted state and the input variables is linear. If the algorithm is successful in finding a linear relationship, it creates a special tree (NODE_TYPE = 25) that represents a linear regression. These regression tree nodes are more complex than nodes that represent discrete values.
In general, a regression maps the changes in the continuous dependent (predictable variable) as a function of changes in the inputs. If the dependent variable has any continuous inputs, and the relationship between the input and predicted value is stable enough to be computed as a line graph, the node for the regression contains a formula.
However, if the relationship between the input and predicted value is nonlinear, a split is created instead, just like a standard decision tree. For example, assume that A is the predictable attribute, and B and C are the inputs, where C is a continuous value type. If the relationship between A and C is fairly stable in parts of the data, but unstable in others, the algorithm will create splits to represent the different areas of the data.
Result in node
if n < 5
Relationship can be expressed as equation 1
if n between 5 and 10
if n > 10
Relationship can be expressed as equation 2
For more information about regression nodes, see Mining Model Content for Linear Regression Models (Analysis Services - Data Mining).