Classification Matrix (Analysis Services - Data Mining)
A classification matrix sorts all cases from the model into categories, by determining whether the predicted value matched the actual value. All the cases in each category are then counted, and the totals are displayed in the matrix. The classification matrix is a standard tool for evaluation of statistical models and is sometimes referred to as a confusion matrix.
The chart that is created when you choose the Classification Matrix option compares actual to predicted values for each predicted state that you specify. The rows in the matrix represent the predicted values for the model, whereas the columns represent the actual values. The categories used in analysis are false positive, true positive, false negative, and true negative
A classification matrix is an important tool for assessing the results of prediction because it makes it easy to understand and account for the effects of wrong predictions. By viewing the amount and percentages in each cell of this matrix, you can quickly see how often the model predicted accurately.
This section explains how to create a classification matrix and how to interpret the results.
Consider the model that you created as part of the Basic Data Mining Tutorial. The [TM_DecisionTree] model is used to help create a targeted mailing campaign, and can be used to predict which customers are most likely to buy a bike. To test this expected usefulness of this model, you use a data set for which the values of the outcome attribute, [Bike Buyer], are already known. Typically, you would use the testing data set that you set aside when you created the mining structure that is used for training the model.
There are only two possible outcomes: yes (the customer is likely to buy a bike), and no (the customer will likely not purchase a bike). Therefore, the resulting classification matrix is relatively simple.
The following table shows the classification matrix for the TM_DecisionTree model. Remember that for this predictable attribute, 0 means No and 1 means Yes.
The first result cell, which contains the value 362, indicates the number of true positives for the value 0. Because 0 indicates that the customer did not purchase a bike, this statistic tells you that model predicted the correct value for non bike-buyers in 362 cases.
The cell directly underneath that one, which contains the value 121, tells you the number of false positives, or how many times the model predicted that someone would buy a bike when actually they did not.
The cell that contains the value 144 indicates the number of false positives for the value 1. Because 1 means that the customer did purchase a bike, this statistic tells you that in 144 cases, the model predicted someone would not buy a bike when in fact they did.
Finally, the cell that contains the value 373 indicates the number of true positives for the target value of 1. In other words, in 373 cases the model correctly predicted that someone would buy a bike.
By summing the values in cells that are diagonally adjacent, you can determine the overall accuracy of the model. One diagonal tells you the total number of accurate predictions, and the other diagonal tells you the total number of erroneous predictions.
The [Bike Buyer] case is especially easy to interpret because there are only two possible values. When the predictable attribute has multiple possible values, the classification matrix adds a new column for each possible actual value and then counts the number of matches for each predicted value. The following table shows the results on a different model, where three values (0, 1, 2) are possible.
Although the addition of more columns makes the report look more complex, the additional detail can be very useful when you want to assess the cumulative cost of making the wrong prediction. To create sums on the diagonals or to compare the results for different combinations of rows, you can click the Copy button provided in the Classification Matrix tab and paste the report into Excel. Alternatively, you can use a client such as the Data Mining Client for Excel, which supports SQL Server 2005 and later versions, to create a classification report directly in Excel that includes both counts and percentages. For more information, see SQL Server Data Mining.
A classification matrix can be used only with discrete predictable attributes.
Although you can add multiple models when selecting models on the Input Selection tab of the Mining Accuracy Chart designer, the Classification Matrix tab will display a separate matrix for each model.
The following topics contain more information about how you can build and use classification matrices and other charts.
Provides a walkthrough of how to create a lift chart for the Targeted Mailing model.
Explains related chart types.
Describes uses of cross-validation for mining models and mining structures.
Describes steps for creating lift charts and other accuracy charts.