Article
05/23/2011

Microsoft Linear Regression Algorithm Technical Reference

The Microsoft Linear Regression algorithm is a special version of the Microsoft Decision Trees algorithm that is optimized for modeling pairs of continuous attributes. This topic explains the implementation of the algorithm, describes how to customize the behavior of the algorithm, and provides links to additional information about querying models.

Implementation of the Linear Regression Algorithm

The Microsoft Decision Trees algorithm can be used for many tasks: linear regression, classification, or association analysis. To implement this algorithm for the purpose of linear regression, the parameters of the algorithm are controlled to restrict the growth of the tree and keep all data in the model in a single node. In other words, although linear regression is based on a decision tree, the tree contains only a single root and no branches: all data resides in the root node.

To accomplish this, the algorithm's MINIMUM_LEAF_CASES parameter is set to be greater than or equal to the total number of cases that the algorithm uses to train the mining model. With the parameter set in this way, the algorithm will never create a split, and therefore performs a linear regression.

The equation that represents the regression line takes the general form of y = ax + b, and is known as the regression equation. The variable Y represents the output variable, X represents the input variable, and a and b are adjustable coefficients. You can retrieve the coefficients, intercepts, and other information about the regression formula by querying the completed mining model. For more information, see Querying a Linear Regression Model (Analysis Services - Data Mining).

Scoring Methods and Feature Selection

All Analysis Services data mining algorithms automatically use feature selection to improve analysis and reduce processing load. The method used for feature selection in linear regression is the interestingness score, because the model supports only supports continuous columns. For reference, the following table shows the difference in feature selection for the Linear Regression algorithm and the Decision Trees algorithm.

Algorithm	Method of analysis	Comments
Linear Regression	Interestingness score	Default. Other feature selection methods that are available with the Decision Trees algorithm apply to discrete variables only and therefore are not applicable to linear regression models.
Decision Trees	Interestingness score Shannon's Entropy Bayesian with K2 Prior Bayesian Dirichlet with uniform prior (default)	If any columns contain non-binary continuous values, the interestingness score is used for all columns, to ensure consistency. Otherwise, the default or specified method is used.

Linear Regression

Interestingness score

Default.

Other feature selection methods that are available with the Decision Trees algorithm apply to discrete variables only and therefore are not applicable to linear regression models.

Decision Trees

Interestingness score

Shannon's Entropy

Bayesian with K2 Prior

Bayesian Dirichlet with uniform prior (default)

If any columns contain non-binary continuous values, the interestingness score is used for all columns, to ensure consistency. Otherwise, the default or specified method is used.

The algorithm parameters that control feature selection for a decision trees model are MAXIMUM_INPUT_ATTRIBUTES and MAXIMUM_OUTPUT.

Customizing the Linear Regression Algorithm

The Microsoft Linear Regression algorithm supports parameters that affect the behavior, performance, and accuracy of the resulting mining model. You can also set modeling flags on the mining model columns or mining structure columns to control the way that data is processed.

Setting Algorithm Parameters

The following table lists the parameters that are provided for the Microsoft Linear Regression algorithm.

Parameter	Description
MAXIMUM_INPUT_ATTRIBUTES	Defines the number of input attributes that the algorithm can handle before it invokes feature selection. Set this value to 0 to turn off feature selection. The default is 255.
MAXIMUM_OUTPUT_ATTRIBUTES	Defines the number of output attributes that the algorithm can handle before it invokes feature selection. Set this value to 0 to turn off feature selection. The default is 255.
FORCED_REGRESSOR	Forces the algorithm to use the indicated columns as regressors, regardless of the importance of the columns as calculated by the algorithm.

MAXIMUM_INPUT_ATTRIBUTES

Defines the number of input attributes that the algorithm can handle before it invokes feature selection. Set this value to 0 to turn off feature selection.

The default is 255.

MAXIMUM_OUTPUT_ATTRIBUTES

Defines the number of output attributes that the algorithm can handle before it invokes feature selection. Set this value to 0 to turn off feature selection.

The default is 255.

FORCED_REGRESSOR

Forces the algorithm to use the indicated columns as regressors, regardless of the importance of the columns as calculated by the algorithm.

Modeling Flags

The Microsoft Linear Regression algorithm supports the following modeling flags. When you create the mining structure or mining model, you define modeling flags to specify how values in each column are handled during analysis. For more information, see Modeling Flags (Data Mining).

Modeling Flag	Description
NOT NULL	Indicates that the column cannot contain a null. An error will result if Analysis Services encounters a null during model training. Applies to mining structure columns.
REGRESSOR	Indicates that the column contains continuous numeric values that should be treated as potential independent variables during analysis. Note Flagging a column as a regressor does not ensure that the column will be used as a regressor in the final model. Applies to mining model columns.

NOT NULL

Indicates that the column cannot contain a null. An error will result if Analysis Services encounters a null during model training.

Applies to mining structure columns.

REGRESSOR

Indicates that the column contains continuous numeric values that should be treated as potential independent variables during analysis.

Note

Flagging a column as a regressor does not ensure that the column will be used as a regressor in the final model.

Applies to mining model columns.

Regressors in Linear Regression Models

Linear regression models are based on the Microsoft Decision Trees algorithm. However, even if you do not use the Microsoft Linear Regression algorithm, any decision tree model can contain a tree or nodes that represent a regression on a continuous attribute.

You do not need to specify that a continuous column represents a regressor. The Microsoft Decision Trees algorithm will partition the dataset into regions with meaningful patterns even if you do not set the REGRESSOR flag on the column. The difference is that when you set the modeling flag, the algorithm will try to find regression equations of the form a*C1 + b*C2 + ... to fit the patterns in the nodes of the tree. The sum of the residuals is calculated, and if the deviation is too great, a split is forced in the tree.

For example, if you are predicting customer purchasing behavior using Income as an attribute, and set the REGRESSOR modeling flag on the column, the algorithm would first try to fit the Income values by using a standard regression formula. If the deviation is too great, the regression formula is abandoned and the tree would be split on some other attribute. The decision tree algorithm would then try to fit a regressor for income in each of the branches after the split.

You can use the FORCED_REGRESSOR parameter to guarantee that the algorithm will use a particular regressor. This parameter can be used with the Microsoft Decision Trees and Microsoft Linear Regression algorithms.

Requirements

A linear regression model must contain a key column, input columns, and at least one predictable column.

Input and Predictable Columns

The Microsoft Linear Regression algorithm supports the specific input columns and predictable columns that are listed in the following table. For more information about what the content types mean when used in a mining model, see Content Types (Data Mining).

Column	Content types
Input attribute	Continuous, Cyclical, Key, Table, and Ordered
Predictable attribute	Continuous, Cyclical, and Ordered

Note

Cyclical and Ordered content types are supported, but the algorithm treats them as discrete values and does not perform special processing.

Microsoft Linear Regression Algorithm Technical Reference

Implementation of the Linear Regression Algorithm

Scoring Methods and Feature Selection

Customizing the Linear Regression Algorithm

Setting Algorithm Parameters

Modeling Flags

Regressors in Linear Regression Models

Requirements

Input and Predictable Columns

See Also

Reference

Concepts

Additional resources