Mining Model Content for Linear Regression Models (Analysis Services - Data Mining)
This topic describes mining model content that is specific to models that use the Microsoft Linear Regression algorithm. For a general explanation of mining model content for all model types, see Mining Model Content (Analysis Services - Data Mining).
A linear regression model has an extremely simple structure. Each model has a single parent node that represents the model and its metadata, and a regression tree node (NODE_TYPE = 25) that that contains the regression formula for each predictable attribute.
Linear regression models use the same algorithm as Microsoft Decision Trees, but different parameters are used to constrain the tree, and only continuous attributes are accepted as inputs. However, because linear regression models are based on the Decision Trees algorithm, linear regression models are displayed by using the Microsoft Decision Tree Viewer. For information, see Viewing a Mining Model with the Microsoft Tree Viewer.
The next section explains how to interpret information in the regression formula node. This information applies not only to linear regression models, but also to decision trees models that contain regressions in a portion of the tree.
This section provides detail and examples only for those columns in the mining model content that have particular relevance for linear regression.
For information about general-purpose columns in the schema rowset, see Mining Model Content (Analysis Services - Data Mining).
When you create a model by using the Microsoft Linear Regression algorithm, the data mining engine creates a special instance of a decision trees model and supplies parameters that constrain the tree to contain all the training data in a single node. All continuous inputs are flagged and evaluated as potential regressors, but only those regressors that fit the data are retained as regressors in the final model. The analysis produces either a single regression formula for each regressor or no regression formula at all.
You can view the complete regression formula in the Mining Legend, by clicking the (All) node in the Microsoft Tree Viewer.
Also, when you create a decision trees model that includes a continuous predictable attribute, sometimes the tree has regression nodes that share the properties of regression tree nodes.
Most of the important information in a regression node is contained in the NODE_DISTRIBUTION table. The following example illustrates the layout of the NODE_DISTRIBUTION table. In this example, the Targeted Mailing mining structure has been used to create a linear regression model that predicts customer income based on age. The model is for the purpose of illustration only, because it can be built easily using the existing AdventureWorks sample data and mining structure.
The NODE_DISTRIBUTION table contains multiple rows, each grouped by a variable. The first two rows are always value types 1 and 3, and describe the target attribute. The succeeding rows provide details about the formula for a particular regressor. A regressor is an input variable that has a linear relationship with the output variable. You can have multiple regressors, and each regressor will have a separate row for the coefficient (VALUETYPE = 7), score gain (VALUETYPE = 8), and statistics (VALUETYPE = 9). Finally, the table has a row that contains the intercept of the equation (VALUETYPE = 11).
Elements of the Regression Formula
The nested NODE_DISTRIBUTION table contains each element of the regression formula in a separate row. The first two rows of data in the example results contain information about the predictable attribute, Yearly Income, which models the dependent variable. The SUPPORT column shows the count of cases in support of the two states of this attribute: either a Yearly Income value was available, or the Yearly Income value was missing.
The VARIANCE column tells you the computed variance of the predictable attribute. Variance is a measure of how scattered the values are in a sample, given an expected distribution. Variance here is calculated by taking the average of the squared deviation from the mean. The square root of the variance is also known as standard deviation. Analysis Services does not provide the standard deviation but you can easily calculate it.
For each regressor, three rows are output. They contain the coefficient, score gain, and regressor statistics.
Finally, the table contains a row that provides the intercept for the equation.
For each regressor, a coefficient (VALUETYPE = 7) is calculated. The coefficient itself appears in the ATTRIBUTE_VALUE column, whereas the VARIANCE column tells you the variance for the coefficient. The coefficients are calculated so as to maximize linearity.
The score gain (VALUETYPE = 8) for each regressor represents the interestingness score of the attribute. You can use this value to estimate the usefulness of multiple regressors.
The regressor statistic (VALUETYPE = 9) is the mean for the attribute for cases that have a value. The ATTRIBUTE_VALUE column contains the mean itself, whereas the VARIANCE column contains the sum of deviations from the mean.
Normally, the intercept (VALUETYPE = 11) or residual in a regression equation tells you the value of the predictable attribute, at the point where the input attribute, is 0. In many cases, this might not happen, and could lead to counterintuitive results.
For example, in a model that predicts income based on age, it is useless to learn the income at age 0. In real-life, it is typically more useful to know about the behavior of the line with respect to average values. Therefore, SQL Server Analysis Services modifies the intercept to express each regressor in a relationship with the mean.
This adjustment is difficult to see in the mining model content, but is apparent if you view the completed equation in the Mining Legend of the Microsoft Tree Viewer. The regression formula is shifted away from the 0 point to the point that represents the mean. This presents a view that is more intuitive given the current data.
Therefore, assuming that the mean age is around 45, the intercept (VALUETYPE = 11) for the regression formula tells you the mean income.