The Microsoft Decision Trees algorithm learns Bayesian networks from a combination of prior knowledge and statistical data. An important part of the algorithm is the methodology for assessing the information value of the priors needed for learning. The approach is based on the assumption of likelihood equivalence, which says that data should not help to discriminate network structures that otherwise represent the same assertions of conditional independence.
Each case is assumed to have a single Bayesian prior network and a single measure of confidence for that network. Using these prior networks, the algorithm then computes the relative posterior probabilities of network structures given the current training data, and identifies the network structures that have the highest posterior probabilities.
The Microsoft Decision Trees algorithm uses different methods to compute the best tree. The method used depends on the task, which can be linear regression, classification, or association analysis. A single model can contain multiple trees for different predictable attributes. Moreover, each tree can contain multiple branches, depending on how many attributes and values there are in the data. The shape and depth of the tree built in a particular model depends on the scoring method and other parameters that were used. Changes in the parameters can also affect where the nodes split.
Building the Tree
When the Microsoft Decision Trees algorithm creates the set of possible input values, it performs feature selection to identify the attributes and values that provide the most information, and removes from consideration the values that are very rare. The algorithm also groups values into bins, to create groupings of values that can be processed as a unit to optimize performance.
A tree is built by determining the correlations between an input and the targeted outcome. After all the attributes have been correlated, the algorithm identifies the single attribute that most cleanly separates the outcomes. This point of the best separation is measured by using an equation that calculates information gain. The attribute that has the best score for information gain is used to divide the cases into subsets, which are then recursively analyzed by the same process, until the tree cannot be split any more.
The exact equation used to evaluate information gain depends on the parameters set when you created the algorithm, the data type of the predictable column, and the data type of the input.
Discrete and Continuous Inputs
When the predictable attribute is discrete and the inputs are discrete, counting the outcomes per input is a matter of creating a matrix and generating scores for each cell in the matrix.
However, when the predictable attribute is discrete and the inputs are continuous, the input of the continuous columns are automatically discretized. You can accept the default and have Analysis Services find the optimum number of bins, or you can control the manner in which continuous inputs are discretized by setting the DiscretizationMethod and DiscretizationBucketCount properties. For more information, see How to: Change the Discretization of a Column in a Mining Model.
For continuous attributes, the algorithm uses linear regression to determine where a decision tree splits.
When the predictable attribute is a continuous numeric data type, feature selection is applied to the outputs as well, to reduce the possible number of outcomes and build the model faster. You can change the threshold for feature selection and thereby increase or decrease the number of possible values by setting the MAXIMUM_OUTPUT_ATTRIBUTES parameter.
For a more detained explanation about how the Microsoft Decision Trees algorithm works with discrete predictable columns, see Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. For more information about how the Microsoft Decision Trees algorithm works with a continuous predictable column, see the appendix of Autoregressive Tree Models for Time-Series Analysis.
Scoring Methods and Feature Selection
The Microsoft Decision Trees algorithm offers three formulas for scoring information gain: Shannon's entropy, Bayesian network with K2 prior, and Bayesian network with a uniform Dirichlet distribution of priors. All three methods are well established in the data mining field. We recommend that you experiment with different parameters and scoring methods to determine which provides the best results. For more information about these scoring methods, see Feature Selection.
All Analysis Services data mining algorithms automatically use feature selection to improve analysis and reduce processing load. The method used for feature selection depends on the algorithm that is used to build the model. The algorithm parameters that control feature selection for a decision trees model are MAXIMUM_INPUT_ATTRIBUTES and MAXIMUM_OUTPUT.
|
Algorithm
|
Method of analysis
|
Comments
|
|---|
|
Decision Trees
|
Interestingness score
Shannon's Entropy
Bayesian with K2 Prior
Bayesian Dirichlet with uniform prior (default)
|
If any columns contain non-binary continuous values, the interestingness score is used for all columns, to ensure consistency. Otherwise, the default or specified method is used.
|
|
Linear Regression
|
Interestingness score
|
Linear Regression only uses interestingness, because it only supports continuous columns.
|
Scalability and Performance
Classification is an important data mining strategy. Generally, the amount of information that is needed to classify the cases grows in direct proportion to the number of input records. This limits the size of the data that can be classified. The Microsoft Decision Trees algorithm using uses the following methods to resolve these problems, improve performance, and eliminate memory restrictions:
-
Feature selection to optimize the selection of attributes.
-
Bayesian scoring to control tree growth.
-
Optimization of binning for continuous attributes.
-
Dynamic grouping of input values to determine the most important values.
The Microsoft Decision Trees algorithm is fast and scalable, and has been designed to be easily parallelized, meaning that all processors work together to build a single, consistent model. The combination of these characteristics makes the decision-tree classifier an ideal tool for data mining.
If performance constraints are severe, you might be able to improve processing time during the training of a decision tree model by using the following methods. However, if you do so, be aware that eliminating attributes to improve processing performance will change the results of the model, and possibly make it less representative of the total population.
-
Increase the value of the COMPLEXITY_PENALTY parameter to limit tree growth.
-
Limit the number of items in association models to limit the number of trees that are built.
-
Increase the value of the MINIMUM_SUPPORT parameter to avoid overfitting.
-
Restrict the number of discrete values for any attribute to 10 or less. You might try grouping values in different ways in different models.