Microsoft Naive Bayes Algorithm

The Microsoft Naive Bayes algorithm is a classification algorithm provided by Microsoft SQL Server 2005 Analysis Services (SSAS) for use in predictive modeling. The algorithm calculates the conditional probability between input and predictable columns, and assumes that the columns are independent. This assumption of independence leads to the name Naive Bayes, with the assumption often being naive in that, by making this assumption, the algorithm does not take into account dependencies that may exist.

This algorithm is less computationally intense than other Microsoft algorithms, and therefore is useful for quickly generating mining models to discover relationships between input columns and predictable columns. You can use this algorithm to do initial explorations of data, and then later you can apply the results to create additional mining models with other algorithms that are more computationally intense and more accurate.

Example

As an ongoing promotional strategy, the marketing department for the Adventure Works Cycle company has decided to target potential customers by mailing out fliers. To reduce costs, they want to send fliers only to those customers who are likely to respond. The company stores information in a database about demographics and response to a previous mailing. They want to use this data to see how demographics such as age and location can help predict response to a promotion, by comparing potential customers to customers who have similar characteristics and who have purchased from the company in the past. Specifically, they want to see the differences between those customers who bought a bicycle and those customers who did not.

By using the Microsoft Naive Bayes algorithm, the marketing department can quickly predict an outcome for a particular customer profile, and can therefore determine which customers are most likely to respond to the fliers. By using the Microsoft Naive Bayes Viewer in Business Intelligence Development Studio, they can also visually investigate specifically which input columns contribute to positive responses to fliers.

How the Algorithm Works

The Microsoft Naive Bayes algorithm calculates the probability of every state of each input column, given each possible state of the predictable column. You can use the Microsoft Naive Bayes Viewer in Business Intelligence Development Studio to see a visual representation of how the algorithm distributes states, as shown in the following graphic.

The Microsoft Naive Bayes Viewer lists each input column in the dataset, and shows how the states of each column are distributed, given each state of the predictable column. You can use this view to identify the input columns that are important in differentiating between states of the predictable column. For example, in the Commute Distance column shown here, the probability that a customer will buy a bike is 0.387 if they commute from one to two miles to work, while the probability that they will not buy a bike is 0.287 if they commute. In this example, the algorithm uses the numeric information, derived from customer characteristics such as commute distance, to predict whether a customer will buy a bike. For more information about using the Microsoft Naive Bayes Viewer, see Viewing a Mining Model with the Microsoft Naive Bayes Viewer.

Using the Algorithm

A Naive Bayes model must contain a key column, input columns, and one predictable column. All columns must be either discrete or discretized columns. For information about discretizing columns, see Discretization Methods.

The Microsoft Naive Bayes algorithm supports specific input column content types, predictable column content types, and modeling flags, which are listed in the following table.

 Input column content types Cyclical, Discrete, Discretized, Key, Table, and Ordered Predictable column content types Cyclical, Discrete, Discretized, Table, and Ordered Modeling flags MODEL_EXISTENCE_ONLY and NOT NULL

All Microsoft algorithms support a common set of functions. However, the Microsoft Naive Bayes algorithm supports additional functions, listed in the following table.

For a list of the functions that are common to all Microsoft algorithms, see Data Mining Algorithms. For more information about how to use these functions, see Data Mining Extensions (DMX) Function Reference.

The Microsoft Naive Bayes algorithm does not support using the Predictive Model Markup Language (PMML) to create mining models.

The Microsoft Naive Bayes algorithm supports several parameters that affect the performance and accuracy of the resulting mining model. The following table describes each parameter.

Parameter Description

MAXIMUM_INPUT_ATTRIBUTES

Specifies the maximum number of input attributes that the algorithm can handle before it invokes feature selection. Setting this value to 0 disables feature selection for input attributes.

The default is 255.

MAXIMUM_OUTPUT_ATTRIBUTES

Specifies the maximum number of output attributes that the algorithm can handle before it invokes feature selection. Setting this value to 0 disables feature selection for output attributes.

The default is 255.

MINIMUM_DEPENDENCY_PROBABILITY

Specifies the minimum dependency probability between input and output attributes. This value is used to limit the size of the content that is generated by the algorithm. This property can be set from 0 to 1. Larger values reduce the number of attributes in the content of the model.

The default is 0.5.

MAXIMUM_STATES

Specifies the maximum number of attribute states that the algorithm supports. If the number of states that an attribute has is greater than the maximum number of states, the algorithm uses the attribute’s most popular states and treats the remaining states as missing.

The default is 100.