Classification Matrix (Analysis Services - Data Mining)
The Classification Matrix tab of the Mining Accuracy Chart tab of Data Mining Designer displays a matrix for each model that you specify in the Input Selection tab. By viewing this chart, which is sometimes referred to as a confusion matrix, you can quickly see how often the model predicted accurately.
The rows for each matrix represent the predicted values for the model, whereas the columns represent the actual values. The classification matrix is created by sorting all cases into categories: whether the predicted value matched the actual value, and whether the predicted value was correct or incorrect. These categories are sometimes referred to as false positive, true positive, false negative, and true negative. All the cases in each category are then counted, and the totals are displayed in the matrix.
This section explains how to create a classification matrix and how to interpret the results.
A classification matrix can be used only with discrete predictable attributes.
For example, consider the model that you created as part of the Basic Data Mining Tutorial. The TM_DecisionTree model, which is used to help create a targeted mailing campaign, can be used to predict which customers are most likely to buy a bike. If the customer is likely to buy a bike, the value of the [Bike Buyer] column is 1; if the customer is unlikely to buy a bike, the value of the [Bike Buyer] column is 0.
To assess whether the model is effective at making predictions, you test it against a data set for which the values of [Bike Buyer] are already known. Typically, you use a testing data set that you set aside when you created the mining structure that is used for training the model. Because this data already contains the actual results, you can quickly determine how many times the model predicted the expected value.
The following table shows the results when a classification matrix is created for the TM_DecisionTree model. Because there are only two possible values for this predictable attribute, 0 and 1, it is fairly easy to tell how often the model correctly makes a prediction.
The first result cell, which contains the value 362, indicates the number of true positives for the value 0. Because 0 indicates that the customer did not purchase a bike, this statistic tells you that model predicted the correct value for non bike-buyers in 362 cases.
The cell directly underneath that one, which contains the value 121, tells you the number of false positives, or how many times the model predicted that someone would buy a bike when actually they did not.
The cell that contains the value 144 indicates the number of false positives for the value 1. Because 1 means that the customer did purchase a bike, this statistic tells you that in 144 cases, the model predicted someone would not buy a bike when in fact they did.
Finally, the cell that contains the value 373 indicates the number of true positives for the target value of 1. In other words, in 373 cases the model correctly predicted that someone would buy a bike.
By summing the values in cells that are diagonally adjacent, you can determine the overall accuracy of the model. One diagonal tells you the total number of accurate predictions, and the other diagonal tells you the total number of erroneous predictions.
Using Multiple Predictable Values
The [Bike Buyer] case is especially easy to interpret because there are only two possible values. When the predictable attribute has multiple possible values, the classification matrix adds a new column for each possible actual value and then counts the number of matches for each predicted value. The following table shows the results on a different model, where three values (0, 1, 2) are possible.
Although the addition of more columns makes the report look more complex, the additional detail can be very useful when you want to assess the cost of making the wrong prediction. To create sums on the diagonals or to compare the results for different combinations of rows, you can click the Copy button provided in the Classification Matrix tab and paste the report into Excel. Alternatively, you can use a client such as the Data Mining Client for Excel, which supports both SQL Server 2005 and SQL Server 2008, to create a classification report directly in Excel that includes both counts and percentages. For more information, see SQL Server Data Mining.
When you create a classification matrix, you follow these basic steps:
In the Mining Accuracy Chart of Data Mining Designer, click the Input Selection tab.
In the Input Selection tab, select a model to evaluate.
Specify the predictable attribute, and optionally, the predictable value.
Choose the data set to use in evaluation.
Click the Classification Matrix tab to automatically generate a report in the classification matrix format.
For a step-by-step procedure that applies to all chart types, see How to: Create an Accuracy Chart for a Mining Model.
The Basic Data Mining Tutorial also includes a walkthrough of how to create a lift chart for the Targeted Mailing model. For more information, see Testing Accuracy with Lift Charts (Basic Data Mining Tutorial).