Article
10/01/2010

Feature Selection in Data Mining

Feature selection is a must for any data mining product. That is because, when you build a data mining model, the dataset frequently contains more information than is needed to build the model. For example, a dataset may contain 500 columns that describe characteristics of customers, but perhaps only 50 of those columns are used to build a particular model. If you keep the unneeded columns while building the model, more CPU and memory are required during the training process, and more storage space is required for the completed model.

Even if resources are not an issue, you typically want to remove unneeded columns because they might degrade the quality of discovered patterns, for the following reasons:

Some columns are noisy or redundant. This noise makes it more difficult to discover meaningful patterns from the data;
To discover quality patterns, most data mining algorithms require much larger training data set on high-dimensional data set. But the training data is very small in some data mining applications.

Feature selection helps solve this problem, of having too much data that is of little value, or of having too little data that is of high value.

Feature Selection in Analysis Services Data Mining

In general, feature selection works by calculating a score for each attribute, and then selecting only the attributes that have the best scores. You can adjust the threshold for the top scores. Feature selection is always performed before the model is trained, to automatically choose the attributes in a dataset that are most likely to be used in the model.

SQL Server 2008 Analysis Services (SSAS)c provides multiple methods for feature selection. The exact method for selecting the attributes with the highest value depends on the algorithm used in your model, and any parameters that you may have set on your model. Feature selection is applied to inputs, predictable attributes, or to states in a column. Only the attributes and states that the algorithm selects are included in the model-building process and can be used for prediction. Predictable columns that are ignored by feature selection are used for prediction, but the predictions are based only on the global statistics that exist in the model.

Note Feature selection affects only the columns that are used in the model, and has no effect on storage of the mining structure. The columns that you leave out of the mining model are still available in the structure, and data in the mining structure columns will be cached.

Definition of Feature Selection Methods

There are many ways to implement feature selection, depending on the type of data that you are working with and the algorithm that you choose for analysis. SQL Server Analysis Services provides several popular and well-established methods for scoring attributes. The method that is applied in any algorithm or data set depends on the data types, and the column usage.

The interestingness score is used to rank and sort attributes in columns that contain nonbinary continuous numeric data.

For columns that contain discrete and discretized data, you can choose from Shannon's entropy and two Bayesian scores; however, if the model contains any continuous columns, the interestingness score will be used to assess all input columns, to ensure consistency.

This section describes each method of feature selection.

Interestingness score

A feature is interesting if it tells you some useful piece of information. Because the definition of what is useful varies depending on the scenario, the data mining industry has developed various ways to measure interestingness. For example, novelty might be interesting in outlier detection, but the ability to discriminate between closely related items, or discriminating weight, might be more interesting for classification.

The measure of interestingness that is used in SQL Server Analysis Services is entropy-based, meaning that attributes with random distributions have higher entropy and lower information gain; therefore, such attributes are less interesting. The entropy for any particular attribute is compared to the entropy of all other attributes, as follows:

Interestingness(Attribute) = - (m - Entropy(Attribute)) * (m - Entropy(Attribute))

Central entropy, or m, means the entropy of the entire feature set. By subtracting the entropy of the target attribute from the central entropy, you can assess how much information the attribute provides.

This score is used by default whenever the column contains nonbinary continuous numeric data.

Shannon's Entropy

Shannon's entropy measures the uncertainty of a random variable for a particular outcome. For example, the entropy of a coin toss can be represented as a function of the probability of it coming up heads.

Analysis Services uses the following formula to calculate Shannon's entropy:

H(X) = -∑ P(xi) log(P(xi))

This scoring method is available for discrete and discretized attributes.

Bayesian with K2 Prior

Analysis Services provides two feature selection scores that are based on Bayesian networks. A Bayesian network is a directed or acyclic graph of states and transitions between states, meaning that some states are always prior to the current state, some states are posterior, and the graph does not repeat or loop. By definition, Bayesian networks allow the use of prior knowledge. However, the question of which prior states to use in calculating probabilities of later states is important for algorithm design, performance, and accuracy.

The K2 algorithm for learning from a Bayesian network was developed by Cooper and Herskovits and is often used in data mining. It is scalable and can analyze multiple variables, but requires ordering on variables used as input. For more information, see Learning Bayesian Networks by Chickering, Geiger, and Heckerman.

This scoring method is available for discrete and discretized attributes.

Bayesian Dirichlet Equivalent with Uniform Prior

The Bayesian Dirichlet Equivalent (BDE) score also uses Bayesian analysis to evaluate a network given a dataset. The BDE scoring method was developed by Heckerman and is based on the BD metric developed by Cooper and Herskovits. The Dirichlet distribution is a multinomial distribution that describes the conditional probability of each variable in the network, and has many properties that are useful for learning.

The Bayesian Dirichlet Equivalent with Uniform Prior (BDEU) method assumes a special case of the Dirichlet distribution, in which a mathematical constant is used to create a fixed or uniform distribution of prior states. The BDE score also assumes likelihood equivalence, which means that the data cannot be expected to discriminate equivalent structures. In other words, if the score for If A Then B is the same as the score for If B Then A, the structures cannot be distinguished based on the data, and causation cannot be inferred.

For more information about Bayesian networks and the implementation of these scoring methods, see Learning Bayesian Networks.

Feature Selection Methods used by Analysis Services Algorithms

The following table lists the algorithms that support feature selection, the feature selection methods used by the algorithm, and the parameters that you set to control feature selection behavior:

Algorithm	Method of analysis	Comments
Naive Bayes	Shannon's Entropy Bayesian with K2 Prior Bayesian Dirichlet with uniform prior (default)	The Microsoft Naïve Bayes algorithm accepts only discrete or discretized attributes; therefore, it cannot use the interestingness score. For more information about this algorithm, see Microsoft Naive Bayes Algorithm Technical Reference.
Decision trees	Interestingness score Shannon's Entropy Bayesian with K2 Prior Bayesian Dirichlet with uniform prior (default)	If any columns contain non-binary continuous values, the interestingness score is used for all columns, to ensure consistency. Otherwise, the default feature selection method is used, or the method that you specified when you created the model. For more information about this algorithm, see Microsoft Decision Trees Algorithm Technical Reference.
Neural network	Interestingness score Shannon's Entropy Bayesian with K2 Prior Bayesian Dirichlet with uniform prior (default)	The Microsoft Neural Networks algorithm can use both methods, as long as the data contains continuous columns. For more information about this algorithm, see Microsoft Neural Network Algorithm Technical Reference.
Logistic regression	Interestingness score Shannon's Entropy Bayesian with K2 Prior Bayesian Dirichlet with uniform prior (default)	Although the Microsoft Logistic Regression algorithm is based on the Microsoft Neural Network algorithm, you cannot customize logistic regression models to control feature selection behavior; therefore, feature selection always default to the method that is most appropriate for the attribute. If all attributes are discrete or discretized, the default is BDEU. For more information about this algorithm, see Microsoft Logistic Regression Algorithm Technical Reference.
Clustering	Interestingness score	The Microsoft Clustering algorithm can use discrete or discretized data. However, because the score of each attribute is calculated as a distance and is represented as a continuous number, the interestingness score must be used. For more information about this algorithm, see Microsoft Clustering Algorithm Technical Reference.
Linear regression	Interestingness score	The Microsoft Linear Regression algorithm can only use the interestingness score, because it only supports continuous columns. For more information about this algorithm, see Microsoft Linear Regression Algorithm Technical Reference.
Association rules Sequence clustering	Not used	Feature selection is not invoked with these algorithms. However, you can control the behavior of the algorithm and reduce the size of input data if necessary by setting the value of the parameters MINIMUM_SUPPORT and MINIMUM_PROBABILIITY. For more information, see Microsoft Association Algorithm Technical Reference and Microsoft Sequence Clustering Algorithm Technical Reference.
Time series	Not used	Feature selection does not apply to time series models. For more information about this algorithm, see Microsoft Time Series Algorithm Technical Reference.

Controlling Feature Selection Behavior

In algorithms that support feature selection, you can control when feature selection is turned on by using the following parameters. Each algorithm has a default value for the number of inputs that are allowed, and you can override this default and specify the number of attributes.

MAXIMUM_INPUT_ATTRIBUTES

If a model contains more columns than the number that is specified in the MAXIMUM_INPUT_ATTRIBUTES parameter, the algorithm ignores any columns that it calculates to be uninteresting.

MAXIMUM_OUTPUT_ATTRIBUTES

Similarly, if a model contains more predictable columns than the number that is specified in the MAXIMUM_OUTPUT_ATTRIBUTES parameter, the algorithm ignores any columns that it calculates to be uninteresting.

MAXIMUM_STATES

If a model contains more cases than are specified in the MAXIMUM_STATES parameter, the least popular states are grouped together and treated as missing. If any one of these parameters is set to 0, feature selection is turned off, affecting processing time and performance.

Change History

Updated content
Reorganized content to present the rationale for feature selection first, and then provide details of implementation. Updated defaults for each model type.
Added links to the technical reference topics for each algorithm