Mining Model Content for Clustering Models (Analysis Services - Data Mining)
Topic Status: Some information in this topic is preview and subject to change in future releases. Preview information describes new features or changes to existing features in Microsoft SQL Server 2016 Community Technology Preview 2 (CTP2).
This topic describes mining model content that is specific to models that use the Microsoft Clustering algorithm. For a general explanation of mining model content for all model types, see Mining Model Content (Analysis Services - Data Mining).
A clustering model has a simple structure. Each model has a single parent node that represents the model and its metadata, and each parent node has a flat list of clusters (NODE_TYPE = 5). This organization is shown in the following image.
Each child node represents a single cluster and contains detailed statistics about the attributes of the cases in that cluster. This includes a count of the number of cases in the cluster, and the distribution of values that distinguish the cluster from other clusters.
Note |
---|
You do not need to iterate through the nodes to get a count or description of the clusters; the model parent node also counts and lists the clusters. |
The parent node contains useful statistics that describe the actual distribution of all the training cases. These statistics are found in the nested table column, NODE_DISTRIBUTION. For example, the following table shows several rows from the NODE_DISTRIBUTION table that describe the distribution of customer demographics for the clustering model, TM_Clustering, that you create in the Basic Data Mining Tutorial:
ATTRIBUTE_NAME |
ATRIBUTE_VALUE |
SUPPORT |
PROBABILITY |
VARIANCE |
VALUE_TYPE |
---|---|---|---|---|---|
Age |
Missing |
0 |
0 |
0 |
1 (Missing) |
Age |
44.9016152716593 |
12939 |
1 |
125.663453102554 |
3 (Continuous) |
Gender |
Missing |
0 |
0 |
0 |
1 (Missing) |
Gender |
F |
6350 |
0.490764355823479 |
0 |
4 (Discrete) |
Gender |
M |
6589 |
0.509235644176521 |
0 |
4 (Discrete) |
From these results, you can see that there were 12939 cases used to build the model, that the ratio of males to females was about 50-50, and that the mean age was 44. The descriptive statistics vary depending on whether the attribute being reported is a continuous numeric data type, such as age, or a discrete value type, such as gender. The statistical measures mean and variance are computed for continuous data types, whereas probability and support are computed for discrete data types.
Note |
---|
The variance represents the total variance for the cluster. When the value for variance is small, it indicates that most values in the column were fairly close to the mean. To obtain the standard deviation, calculate the square root of the variance. |
Note that for each of the attributes there is a Missing value type that tells you how many cases had no data for that attribute. Missing data can be significant and affects calculations in different ways, depending on the data type. For more information, see Missing Values (Analysis Services - Data Mining).
This section provides detail and examples only for those columns in the mining model content that are relevant for clustering models.
For information about the general-purpose columns in the schema rowset, such as MODEL_CATALOG and MODEL_NAME, see Mining Model Content (Analysis Services - Data Mining).
Analysis Services provides multiple methods for creating a clustering model. If you do not know which method was used to create the model that you are working with, you can retrieve the model metadata programmatically, by using an ADOMD client or AMO, or by querying the data mining schema rowset. For more information, see Query the Parameters Used to Create a Mining Model.
Note |
---|
The structure and content of the model stay the same, regardless of which clustering method or parameters you use. |