Updated: March 2, 2016
The ClusterDistance function returns the distance of the input case from the specified cluster, or if no cluster is specified, the distance of the input case from the most likely cluster.
This function can be used only if the underlying data mining model supports clustering. The function can be used with any kind of clustering model (EM, K-Means, etc.), but the results differ depending on the algorithm.
A scalar value.
The ClusterDistance function returns the distance between the input case and the cluster that has the highest probability for that input case.
In case of K-Means clustering, since any case can belong to only one cluster, with a membership weight of 1.0, the cluster distance is always 0. However, in K-Means, each cluster is assumed to have a centroid. You can obtain the value of the centroid by querying or browsing the NODE_DISTRIBUTION nested table in the mining model content. For more information, see Mining Model Content for Clustering Models (Analysis Services - Data Mining).
In the case of the default EM clustering method, all the points inside the cluster are considered equally likely; therefore, by design there is no centroid for the cluster. The value of ClusterDistance between a particular case and a particular cluster N is calculated as follows:
ClusterDistance(N) =1–ClusterProbability (N))
Analysis Services provides the following additional functions for querying clustering models:
Use the Cluster (DMX) function to return the most likely cluster.
Use the ClusterProbability (DMX) function to get the probability that a case belongs to a particular cluster. This value serves as the inverse of the cluster distance.
Use the PredictHistogram (DMX) function to return a histogram of the likelihood of the input case existing in each of the model’s clusters.
Use the PredictCaseLikelihood (DMX) function to return a measure from 0 to 1 that indicates how likely an input case is to exist considering the model learned by the algorithm.
The following example returns the distance from the specified case to the cluster that the case most likely belongs to.
SELECT ClusterDistance() FROM [TM Clustering] NATURAL PREDICTION JOIN (SELECT 28 AS [Age], '2-5 Miles' AS [Commute Distance], 'Graduate Degree' AS [Education], 0 AS [Number Cars Owned], 0 AS [Number Children At Home]) AS t
To find out which cluster this is, you can substitute
ClusterDistance in the preceding sample.
The following syntax uses the mining model content schema rowset to return the list of node IDs and node captions for the clusters in the mining model. You can then use the node caption as the cluster identifier argument in the ClusterDistance function.
SELECT NODE_UNIQUE_NAME, NODE_CAPTION FROM <model>.CONTENT WHERE NODE_TYPE = 5
The following syntax example returns the distance of the specified case from the cluster labeled Cluster 2.
SELECT ClusterDistance('Cluster 2') AS [Cluster 2 Distance] FROM [TM Clustering] NATURAL PREDICTION JOIN (SELECT 28 AS [Age], '2-5 Miles' AS [Commute Distance], 'Graduate Degree' AS [Education], 0 AS [Number Cars Owned], 0 AS [Number Children At Home]) AS t
|Cluster 2 Distance|