Microsoft Naive Bayes Algorithm

The Microsoft Naive Bayes algorithm is a classification algorithm provided by Microsoft SQL Server Analysis Services for use in predictive modeling. The name Naive Bayes derives from the fact that the algorithm uses Bayes theorem but does not take into account dependencies that may exist, and therefore its assumptions are said to be naive.

This algorithm is less computationally intense than other Microsoft algorithms, and therefore is useful for quickly generating mining models to discover relationships between input columns and predictable columns. You can use this algorithm to do initial explorations of data, and then later you can apply the results to create additional mining models with other algorithms that are more computationally intense and more accurate.

Example

As an ongoing promotional strategy, the marketing department for the Adventure Works Cycle company has decided to target potential customers by mailing out fliers. To reduce costs, they want to send fliers only to those customers who are likely to respond. The company stores information in a database about demographics and response to a previous mailing. They want to use this data to see how demographics such as age and location can help predict response to a promotion, by comparing potential customers to customers who have similar characteristics and who have purchased from the company in the past. Specifically, they want to see the differences between those customers who bought a bicycle and those customers who did not.

By using the Microsoft Naive Bayes algorithm, the marketing department can quickly predict an outcome for a particular customer profile, and can therefore determine which customers are most likely to respond to the fliers. By using the Microsoft Naive Bayes Viewer in Business Intelligence Development Studio, they can also visually investigate specifically which input columns contribute to positive responses to fliers.

How the Algorithm Works

The Microsoft Naive Bayes algorithm calculates the probability of every state of each input column, given each possible state of the predictable column. You can use the Microsoft Naive Bayes Viewer in Business Intelligence Development Studio to see a visual representation of how the algorithm distributes states, as shown in the following graphic.

Naive bayes distribution of states

The Microsoft Naive Bayes Viewer lists each input column in the dataset, and shows how the states of each column are distributed, given each state of the predictable column. You can use this view to identify the input columns that are important for differentiating between states of the predictable column. For example, in the Commute Distance column shown here, if the customer commutes from one to two miles to work, the probability that the customer will buy a bike is 0.387, and the probability that the customer will not buy a bike is 0.287. In this example, the algorithm uses the numeric information, derived from customer characteristics such as commute distance, to predict whether a customer will buy a bike. For more information about using the Microsoft Naive Bayes Viewer, see Viewing a Mining Model with the Microsoft Naive Bayes Viewer.

Data Required for Naive Bayes Models

When you prepare data for use in training a Naive Bayes model, you should understand the requirements for the algorithm, including how much data is needed, and how the data is used.

The requirements for a Naive Bayes model are as follows:

  • A single key column   Each model must contain one numeric or text column that uniquely identifies each record. Compound keys are not allowed.

  • Input columns   In a Naive Bayes model, all columns must be either discrete or discretized columns. For information about discretizing columns, see Discretization Methods (Data Mining). For a Naive Bayes model, it is important to ensure that the input attributes are independent of each other.

  • At least one predictable column    The predictable attribute must contain discrete or discretized values. The values of the predictable column can be treated as input and frequently are, to find relationships among the columns.

Viewing the Model

To explore the model, you can use the Microsoft Naive Bayes Viewer. The viewer shows you how the input attributes relate to the predictable attribute. The viewer also provides a detailed profile of each cluster, a list of the attributes that distinguish each cluster from the others, and the characteristics of the entire training data set. For more information, see Viewing a Mining Model with the Microsoft Naive Bayes Viewer.

If you want to know more detail, you can browse the model in the Microsoft Generic Content Tree Viewer (Data Mining Designer). For more information about the type of information stored in the model, see Mining Model Content for Naive Bayes Models (Analysis Services - Data Mining).

Making Predictions

After the model has been trained, the results are stored as a set of patterns, which you can explore or use to make predictions.

You can create queries to return predictions about how new data relates to the predictable attribute, or you can retrieve statistics that describe the correlations found by the model.

For information about how to create queries against a data mining model, see Querying Data Mining Models (Analysis Services - Data Mining). For examples of how to use queries with a Naive Bayes model, see Querying a Naive Bayes Model (Analysis Services - Data Mining).

Remarks

  • Supports the use of Predictive Model Markup Language (PMML) to create mining models.

  • Supports drillthrough.

  • Does not support the creation of data mining dimensions.

  • Supports the use of OLAP mining models.