Missing Values (Analysis Services - Data Mining)
A missing value can signify a number of different things in your data. Perhaps the field was not applicable, the event did not happen, or the data was not available. It could be that the person who entered the data did not know the right value, or did not care if a field was not filled in. Therefore, Analysis Services provides two distinctly different mechanisms for managing and calculating these missing values, also known as null values.
If the task that you are modeling specifies that a column must never have missing values, you should use the NOT_NULL modeling flag when you define the mining structure. This will ensure that processing will fail if a case does not have an appropriate value. If an error occurs when processing a model, you can then log the error and take steps to correct the data that is supplied to the model. There are a variety of tools that you can use to infer and fill in appropriate values, such as the Lookup transformation or the Data Profiler task in SQL Server Integration Services, or the Fill By Example tool provided in the Data Mining Add-Ins for Excel.
However, there are also many data mining scenarios in which missing values provide important information. Generally, Analysis Services treats missing values as informative and adjusts the probabilities to incorporate the missing values into its calculations. By doing so, you can ensure that models are balanced and do not weight existing cases too heavily. This section explains how values are defined and counted as Missing in models that permit null values. This topic also describes how data mining algorithms process and use these Missing values when creating a model.
Each algorithm, including custom algorithms that you may have obtained from a third-party plug-in, can handle Missing values differently.
To the data mining algorithm, missing values are informative. In case tables, Missing is a valid state like any other. Moreover, a data mining model can use other values to predict whether a value is missing. In other words, the fact that a value is missing is not treated as an error.
When you create a data mining model, a Missing state is automatically added to the model for all discrete columns. For example, if the input column for Gender contains two possible values, Male and Female, a third value is automatically added to represent the Missing value, and the histogram that shows the distribution of all values for the column always includes a count of the cases with Missing values. If the Gender column is not missing any values, the histogram shows that the Missing state is found in 0 cases.
Including the Missing state by default makes sense when you consider that your data might not have examples of all possible values, and you would not want the model to exclude the possibility just because there was no example in the data. For example, if sales data for a store showed that all customers who purchased a certain product happened to be women, you would not want to create a model that predicts that only women could purchase the product. Instead, Analysis Services adds a placeholder for the extra unknown value, called Missing, as a way of accommodating possible other states.
For example, the following table shows the distribution of values for the (All) node in the decision tree model created for the Bike Buyer tutorial. In the example scenario, the [Bike Buyer] column is the predictable attribute, where 1 indicates "Yes" and 0 indicates "No".
This distribution shows that about half of the customers have purchased a bike, and half have not. This particular data set is very clean; therefore, every case has a value in the [Bike Buyer] column, and the count of Missing values is 0. However, if any case had a null in the [Bike Buyer] field, Analysis Services would count that row as a case with a Missing value.
If the input is a continuous column, the model tabulates two possible states for the attribute: Existing and Missing. In other words, either the column contains a value of some numeric data type, or it contains no value. For cases that have a value, the model calculates mean, standard deviation, and other meaningful statistics. For cases that have no value, the model provides a count of the Missing vales and adjusts predictions accordingly. The method for adjusting the prediction differs depending on the algorithm and is described in the following section.
For attributes in a nested table, missing values are not informative. For example, if a customer has not purchased a product, the nested Products table would not have a row corresponding to that product, and the mining model would not create an attribute for the missing product. However, if you are interested in customers who have not purchased certain products, you can create a model that is filtered on the non-existence of the products in the nested table, by using a NOT EXISTS statement in the model filter. For more information, see How to: Apply a Filter to a Mining Model.
In addition to counting values, Analysis Services calculates the probability of any value across the data set. The same is true for Missing values. For example, the following table shows the probabilities for the cases in the previous example:
It may seem odd that the probability of the Missing value is calculated as 0.03%, when the number of cases is 0. In fact, this behavior is by design, and represents an adjustment that lets the model handle unknown values gracefully.
In general, probability is calculated as the favorable cases divided by all possible cases. In this example, the algorithm computes the sum of the cases that meet a particular condition ([Bike Buyer] = 1, or [Bike Buyer] = 0), and divides that number by the total count of rows. However, to account for the Missing cases, 1 is added to the number of all possible cases. As a result, the probability for the unknown case is no longer zero, but a very small number, indicating that the state is merely improbable, not impossible.
The addition of the small Missing value does not change the outcome of the predictor; however, it enables better modeling in scenarios where the historical data does not include all possible outcomes.
Data mining providers differ in the way they handle missing values. For example, some providers assume that missing data in a nested column is sparse representation, but that missing data in a non-nested column is missing at random.
If you are certain that all outcomes are specified in your data and want to prevent probabilities from being adjusted, you should set the NOT_NULL modeling flag on the column in the mining structure.
Special Handling of Missing Values in Decision Tree Models
The Microsoft Decision Trees algorithm calculates probabilities for missing values differently than in other algorithms. Instead of just adding 1 to the total number of cases, the decision trees algorithm adjusts for Missing states by using a slightly different formula.
In a decision tree model, the probability of the missing state is calculated as follows:
StateProbability = (NodePriorProbability)* (StateSupport + 1) / (NodeSupport + TotalStates)
Moreover, in SQL Server 2008 Analysis Services, the Decision Trees algorithm provides an additional adjustment that helps the algorithm compensate for the presence of filters on the model, which may result in many states to be excluded during training.
In SQL Server 2008, if a state is present during training but just happens to have 0 support in a certain node, the standard adjustment is made. However, if a state is never encountered during training, the algorithm sets the probability to exactly 0. This adjustment applies not only to the Missing state, but also to states that exist in the training data but have 0 support as result of model filtering.
This additional adjustment results in the following formula:
StateProbability = 0.0 if that state has 0 support in the training set
ELSE StateProbability = (NodePriorProbability)* (StateSupport + 1) / (NodeSupport + TotalStatesWithNonZeroSupport)
The net effect of this adjustment is to maintain the stability of the tree.