Querying a Clustering Model (Analysis Services - Data Mining)
When you create a query against a data mining model, you can retrieve metadata about the model, or create a content query that provides details about the patterns discovered in analysis. Alternatively, you can create a prediction query, which uses the patterns in the model to make predictions for new data. Each type of query will provide different information. For example, a content query might provide additional details about the clusters that were found, whereas a prediction query might tell you to which cluster a new data point is most likely to belong.
This section explains how to create queries for models that are based on the Microsoft Clustering algorithm.
All mining models expose the content learned by the algorithm according to a standardized schema, the mining model schema rowset. You can create queries against the mining model schema rowset by using Data Mining Extension (DMX) statements. In SQL Server 2008, you can also query the schema rowsets directly as system tables.
Sample Query 1: Getting Model Metadata by Using DMX
The following query returns basic metadata about the clustering model, TM_Clustering, that you created in the Basic Data Mining Tutorial. The metadata available in the parent node of a clustering model includes the name of the model, the database where the model is stored, and the number of child nodes in the model. This query uses a DMX content query to retrieve the metadata from the parent node of the model:
SELECT MODEL_CATALOG, MODEL_NAME, NODE_CAPTION, NODE_SUPPORT, [CHILDREN_CARDINALITY], NODE_DESCRIPTION FROM TM_Clustering.CONTENT WHERE NODE_TYPE = 1
You must enclose the name of the column, CHILDREN_CARDINALITY, in brackets to distinguish it from the Multidimensional Expressions (MDX) reserved keyword of the same name.
Adventure Works DW
For a definition of what these columns mean in an clustering model, see Mining Model Content for Clustering Models (Analysis Services - Data Mining).
Sample Query 2: Retrieving Model Metadata from the Schema Rowset
By querying the data mining schema rowset, you can find the same information that is returned in a DMX content query. However, the schema rowset provides some additional columns. These include the parameters that were used when the model was created, the date and time that the model was last processed, and the owner of the model.
The following example returns the date the model was created, modified, and last processed, together with the clustering parameters that were used to build the model, and the size of the training set. This information can be useful for documenting the model, or for determining which of the clustering options were used to create an existing model.
SELECT MODEL_NAME, DATE_CREATED, LAST_PROCESSED, PREDICTION_ENTITY, MINING_PARAMETERS from $system.DMSCHEMA_MINING_MODELS WHERE MODEL_NAME = 'TM_Clustering'
10/12/2007 7:42:51 PM
10/12/2007 8:09:54 PM
The most useful content queries on clustering models generally return the same type of information that you can browse by using the Cluster Viewer. This includes cluster profiles, cluster characteristics, and cluster discrimination. This section provides examples of queries that retrieve this information.
Sample Query 3: Returning a Cluster or List of Clusters
Because all clusters have a node type of 5, you can easily retrieve a list of the clusters by querying the model content for only the nodes of that type. You can also filter the nodes that are returned by probability or by support, as shown in this example.
SELECT NODE_NAME, NODE_CAPTION ,NODE_SUPPORT, NODE_DESCRIPTION FROM TM_Clustering.CONTENT WHERE NODE_TYPE = 5 AND NODE_SUPPORT > 1000
English Education=Graduate Degree , 32 <=Age <=48 , Number Cars Owned=0 , 35964.0771121808 <=Yearly Income <=97407.7163393957 , English Occupation=Professional , Commute Distance=2-5 Miles , Region=North America , Bike Buyer=1 , Number Children At Home=0 , Number Cars Owned=1 , Commute Distance=0-1 Miles , English Education=Bachelors , Total Children=1 , Number Children At Home=2 , English Occupation=Skilled Manual , Marital Status=S , Total Children=0 , House Owner Flag=0 , Gender=F , Total Children=2 , Region=Pacific
The attributes that define the cluster can be found in two columns in the data mining schema rowset.
The NODE_DESCRIPTION column contains a comma-separated list of attributes. Note that the list of attributes might be abbreviated for display purposes.
The nested table in the NODE_DISTRIBUTION column contains the full list of attributes for the cluster. If your client does not support hierarchical rowsets, you can return the nested table by adding the FLATTENED keyword before the SELECT column list. For more information about the use of the FLATTENED keyword, see SELECT FROM <model>.CONTENT (DMX).
Sample Query 4: Returning Attributes for a Cluster
For every cluster, the Cluster Viewer displays a profile that lists the attributes and their values. The viewer also displays a histogram that shows the distribution of values for the whole population of cases in the model. If you are browsing the model in the viewer, you can easily copy the histogram from the Mining Legend and then paste it to Excel or a Word document. You can also use the Cluster Characteristics pane of the viewer to graphically compare the attributes of different clusters.
However, if you must obtain values for more than one cluster at a time, it is easier to query the model. For example, when you browse the model, you might notice that the top two clusters differ with regard to one attribute, Number Cars Owned. Therefore, you want to extract the values for each cluster.
SELECT TOP 2 NODE_NAME, (SELECT ATTRIBUTE_VALUE, [PROBABILITY] FROM NODE_DISTRIBUTION WHERE ATTRIBUTE_NAME = 'Number Cars Owned') AS t FROM [TM_Clustering].CONTENT WHERE NODE_TYPE = 5
The first line of the code specifies that you want only the top two clusters.
By default, the clusters are ordered by support. Therefore, the NODE_SUPPORT column can be omitted.
The second line of the code adds a sub-select statement that returns only certain columns from the nested table column. Furthermore, it restricts the rows from the nested table to those related to the target attribute, Number Cars Owned. To simplify the display, the nested table is aliased.
The nested table column, PROBABILITY, must be enclosed in brackets because it is also the name of a reserved MDX keyword.
Sample Query 5: Returning a Cluster Profile Using System Stored Procedures
As a shortcut, rather than writing your own queries by using DMX, you can also call the system stored procedures that Analysis Services uses to work with clusters. The following example illustrates how to use the internal stored procedures to return the profile for a cluster with the ID of 002.
CALL System.Microsoft.AnalysisServices.System.DataMining.Clustering.GetClusterProfiles("TM_Clustering", '002',0.0005)
Similarly, you can use a system stored procedure to return the characteristics of a specific cluster, as shown in the following example:
CALL System.Microsoft.AnalysisServices.System.DataMining.Clustering.GetClusterCharacteristics("TM_Clustering", '009',0.0005)
Number Children at Home
The data mining system stored procedures are for internal use and Microsoft reserves the right to change them as needed. For production use, we recommend that you create queries by using DMX, AMO, or XMLA.
Sample Query 6: Finding Discriminating Factors for a Cluster
The Cluster Discrimination tab of the Cluster Viewer enables you to easily compare a cluster with another cluster, or compare a cluster with all remaining cases (the complement of the cluster).
However, creating queries to return this information can be complex, and you might need some additional processing on the client to store the temporary results and compare the results of two or more queries. As a shortcut, you can use the system stored procedures.
The following query returns a single table that indicates the primary discriminating factors between the two clusters that have the node IDs of 009 and 007. Attributes with positive values favor cluster 009, whereas attributes with negative values favor cluster 007.
This is the same information that is presented in the chart of the Cluster Discrimination viewer if you select Cluster 9 from the first drop-down list and Cluster 7 from the second drop-down list. To compare Cluster 9 with its complement, you use the empty string in the second parameter, as shown in the following example:
The data mining system stored procedures are for internal use and Microsoft reserves the right to change the arguments or behavior as needed. For production use, we recommend that you create queries by using DMX, AMO, or XMLA.
Sample Query 7: Returning Cases that Belong to a Cluster
If drillthrough has been enabled on the mining model, you can create queries that return detailed information about the cases used in the model. Moreover, if drillthrough has been enabled on the mining structure, you can include columns from the underlying structure by using the StructureColumn (DMX) function.
The following example returns two columns that were used in the model, Age and Region, and one more column, First Name, that was not used in the model. The query returns only cases that were classified into Cluster 1.
SELECT [Age], [Region], StructureColumn('First Name') FROM [TM_Clustering].CASES WHERE IsInNode('001')
To return the cases that belong to a cluster, you must know the ID of the cluster. You can obtain the ID of the cluster by browsing the model in one of the viewers. Or, you can rename a cluster for easier reference, after which you could use the name in place of an ID number. However, know that the names that you assign to a cluster will be lost if the model is reprocessed.
Although clustering is typically used for describing and understanding data, the Microsoft implementation also lets you make prediction about cluster membership, and return probabilities associated with the prediction. This section provides examples of how to create prediction queries on clustering models. You can make predictions for multiple cases, by specifying a tabular data source, or you can provide new values on at a time by creating a singleton query. For clarity the examples in this section are all singleton queries.
For more information about how to create prediction queries using DMX, see Creating DMX Prediction Queries.
For information about how to build prediction queries by using the graphical tools in BI Development Studio, see Mining Model Prediction Tab: How-to Topics.
Sample Query 8: Predicting Outcomes from a Clustering Model
If the clustering model you create contains a predictable attribute, you can use the model to make predictions about outcomes. However, the model handles the predictable attribute differently depending on whether you set the predictable column to Predict or PredictOnly. If you set the usage of the column to Predict, the values for that attribute are added to the clustering model and appear as attributes in the finished model. However, if you set the usage of the column to PredictOnly, the values are not used to create clusters. Instead, after the mode is completed, the clustering algorithm creates new values for the PredictOnly attribute based on the clusters to which each case belongs.
The following query provides a single new case to the model, where the only information about the case is the age and gender. The SELECT statement specifies the predictable attribute/value pair that you are interested in, and the PredictProbability (DMX) function tells you the probability that a case with those attributes will have the targeted outcome.
SELECT [TM_Clustering].[Bike Buyer], PredictProbability([Bike Buyer],1) FROM [TM_Clustering] NATURAL PREDICTION JOIN (SELECT 40 AS [Age], 'F' AS [Gender]) AS t
Example of results when usage is set to Predict:
Example of results when the usage is set to PredictOnly and the model is reprocessed:
In this example, the difference in the model is not significant. However, sometimes it can be important to detect differences between the actual distribution of values and what the model predicts. The PredictCaseLikelihood (DMX) function is useful in this scenario, because it tells you how likely a case is, given the model.
The number that is returned by the PredictCaseLikelihood function is a probability, and therefore is always between 0 and 1, with a value of 0.5 representing a random outcome. Therefore, a score less than 0.5 means that the predicted case is unlikely, given the model, and a score over 0.5 indicates that the predicted case is more likely than the random case.
For example, the following query returns two values that characterize the likelihood of a new sample case. The non-normalized value represents the probability given the current model. When you use the NORMALIZED keyword, the likelihood score that is returned by the function is adjusted by dividing "probability with the model" by "probability without the model".
SELECT PredictCaseLikelihood(NORMALIZED) AS [NormalizedValue], PredictCaseLikelihood(NONNORMALIZED) AS [NonNormalizedValue] FROM [TM_Clustering_PredictOnly] NATURAL PREDICTION JOIN (SELECT 40 AS [Age], 'F' AS [Gender]) AS t
Note that the numbers in these results are expressed in scientific notation.
Sample Query 9: Determining Cluster Membership
This example uses the Cluster (DMX) function to return the cluster to which the new case is most likely to belong, and uses the ClusterProbability (DMX) function to return the probability for membership in that cluster.
SELECT Cluster(), ClusterProbability() FROM [TM_Clustering] NATURAL PREDICTION JOIN (SELECT 40 AS [Age], 'F' AS [Gender], 'S' AS [Marital Status]) AS t
Note By default, the ClusterProbability function returns the probability of the most likely cluster. However, you can specify a different cluster by using the syntax ClusterProbability('cluster name'). If you do this, be aware that the results from each prediction function are independent of the other results. Therefore, the probability score in the second column could refer to a different cluster than the cluster named in the first column.
Sample Query 10: Returning All Possible Clusters with Probability and Distance
In the previous example, the probability score was not very high. To determine if there is a better cluster, you can use the PredictHistogram (DMX) function together with the Cluster (DMX) function to return a nested table that includes all possible clusters, together with the probability that the new case that belongs to each cluster. The FLATTENED keyword is used to change the hierarchical rowset into a flat table for easier viewing.
SELECT FLATTENED PredictHistogram(Cluster()) FROM [TM_Clustering] NATURAL PREDICTION JOIN (SELECT 40 AS [Age], 'F' AS [Gender], 'S' AS [Marital Status]) AS t
By default, the results are ranked by probability. The results tell you that, even though the probability for Cluster 2 is fairly low, Cluster 2 is still the best fit for the new data point.
Note The additional column, $DISTANCE, represents the distance from the data point to the cluster. By default, the Microsoft Clustering Algorithm uses scalable EM clustering, which assigns multiple clusters to each data point and ranks the possible clusters. However, if you create your clustering model using the K-means algorithm, only one cluster can be assigned to each data point, and this query would return only one row. Understanding these differences is necessary to interpret the results of the PredictCaseLikelihood (DMX) function. For more information about the differences between EM and K-means clustering, see Microsoft Clustering Algorithm Technical Reference.
All Microsoft algorithms support a common set of functions. However, models that are built by using the Microsoft Clustering algorithm support the additional functions that are listed in the following table.
For a list of the functions that are common to all Microsoft model types, see Algorithm Reference (Analysis Services - Data Mining). For the syntax of specific functions, see Data Mining Extensions (DMX) Function Reference.