ClusterDistance (DMX)

The ClusterDistance function returns the distance of the input case from the specified cluster, or if no cluster is specified, the distance of the input case from the most likely cluster.

Syntax

ClusterDistance([<ClusterID expression>])

Applies To

This function can be used only if the underlying data mining model supports clustering. The function can be used with any kind of clustering model (EM, K-Means, etc.), but the results differ depending on the algorithm.

Return Type

A scalar value.

Remarks

The ClusterDistance function returns the distance between the input case and the cluster that has the highest probability for that input case.

In case of K-Means clustering, since any case can belong to only one cluster, with a membership weight of 1.0, the cluster distance is always 0. However, in K-Means, each cluster is assumed to have a centroid. You can obtain the value of the centroid by querying or browsing the NODE_DISTRIBUTION nested table in the mining model content. For more information, see Mining Model Content for Clustering Models (Analysis Services - Data Mining).

In the case of the default EM clustering method, all the points inside the cluster are considered equally likely; therefore, by design there is no centroid for the cluster. The value of ClusterDistance between a particular case and a particular cluster N is calculated as follows:

ClusterDistance(N) =1–(membershipWeight(N))

Or:

ClusterDistance(N) =1–ClusterProbability (N))

Analysis Services provides the following additional functions for querying clustering models:

  • Use the Cluster (DMX) function to return the most likely cluster.

  • Use the ClusterProbability (DMX) function to get the probability that a case belongs to a particular cluster. This value serves as the inverse of the cluster distance.

  • Use the PredictHistogram (DMX) function to return a histogram of the likelihood of the input case existing in each of the model’s clusters.

  • Use the PredictCaseLikelihood (DMX) function to return a measure from 0 to 1 that indicates how likely an input case is to exist considering the model learned by the algorithm.

Example1: Obtaining Cluster Distance to the Most Likely Cluster

The following example returns the distance from the specified case to the cluster that the case most likely belongs to.

SELECT
    ClusterDistance()
FROM
    [TM Clustering]
NATURAL PREDICTION JOIN
(SELECT 28 AS [Age],
    '2-5 Miles' AS [Commute Distance],
    'Graduate Degree' AS [Education],
    0 AS [Number Cars Owned],
    0 AS [Number Children At Home]) AS t

Example results:

Expression

0.0477390930705145

To find out which cluster this is, you can substitute Cluster for ClusterDistance in the preceding sample.

Example results:

$CLUSTER

Cluster 6

Example2: Obtaining Distance to a Specified Cluster

The following syntax uses the mining model content schema rowset to return the list of node IDs and node captions for the clusters in the mining model. You can then use the node caption as the cluster identifier argument in the ClusterDistance function.

SELECT NODE_UNIQUE_NAME, NODE_CAPTION 
FROM <model>.CONTENT 
WHERE NODE_TYPE = 5

Example results:

NODE_UNIQUE_NAME

NODE_CAPTION

001

Cluster 1

002

Cluster 2

The following syntax example returns the distance of the specified case from the cluster labeled Cluster 2.

SELECT
    ClusterDistance('Cluster 2')
AS [Cluster 2 Distance]
FROM [TM Clustering]
NATURAL PREDICTION JOIN
(SELECT 28 AS [Age],
    '2-5 Miles' AS [Commute Distance],
    'Graduate Degree' AS [Education],
    0 AS [Number Cars Owned],
    0 AS [Number Children At Home]) AS t

Example results:

Cluster 2 Distance

0.97008209236394