Exploring the Clustering Model (Basic Data Mining Tutorial)
Applies To: SQL Server 2016 Preview
The Microsoft Clustering algorithm groups cases into clusters that contain similar characteristics. These groupings are useful for exploring data, identifying anomalies in the data, and creating predictions.
The Microsoft Cluster Viewer provides the following tabs for use in exploring clustering mining models:
The Cluster Diagram tab displays all the clusters that are in a mining model. The lines between the clusters represent "closeness" and are shaded based on how similar the clusters are. The actual color of each cluster represents the frequency of the variable and the state in the cluster.
To explore the model in the Cluster Diagram tab
Use the Mining Model list at the top of the Mining Model Viewer tab to switch to the TM_Clustering model.
In the Viewer list, select Microsoft Cluster Viewer.
In the Shading Variable box, select Bike Buyer.
The default variable is Population, but you can change this to any attribute in the model, to discover which clusters contain members that have the attributes you want.
Select 1 in the State box to explore those cases where a bike was purchased.
The Density legend describes the density of the attribute state pair selected in the Shading Variable and the State. In this example it tells us that the cluster with the darkest shading has the highest percentage of bike buyers.
Pause your mouse over the cluster with the darkest shading.
A tooltip displays the percentage of cases that have the attribute, Bike Buyer = 1.
Select the cluster that has the highest density, right-click the cluster, select Rename Cluster and type Bike Buyers High for later identification. Click OK.
Find the cluster that has the lightest shading (and the lowest density). Right-click the cluster, select Rename Cluster and type Bike Buyers Low. Click OK.
Click the Bike Buyers High cluster and drag it to an area of the pane that will give you a clear view of its connections to the other clusters.
When you select a cluster, the lines that connect this cluster to other clusters are highlighted, so that you can easily see all the relationships for this cluster. When the cluster is not selected, you can tell by the darkness of the lines how strong the relationships are amongst all the clusters in the diagram. If the shading is light or nonexistent, the clusters are not very similar.
Use the slider to the left of the network, to filter out the weaker links and find the clusters with the closest relationships. The Adventure Works Cycles marketing department might want to combine similar clusters together when determining the best method for delivering the targeted mailing.
The Cluster Profiles tab provides an overall view of the TM_Clustering model. The Cluster Profiles tab contains a column for each cluster in the model. The first column lists the attributes that are associated with at least one cluster. The rest of the viewer contains the distribution of the states of an attribute for each cluster. The distribution of a discrete variable is shown as a colored bar with the maximum number of bars displayed in the Histogram bars list. Continuous attributes are displayed with a diamond chart, which represents the mean and standard deviation in each cluster.
To explore the model in the Cluster Profiles tab
Set Histogram bars to 5.
In our model, 5 is the maximum number of states for any one variable.
If the Mining Legend blocks the display of the Attribute profiles, move it out of the way.
Select the Bike Buyers High column and drag it to the right of the Population column.
Select the Bike Buyers Low column and drag it to the right of the Bike Buyers High column.
Click the Bike Buyers High column.
The Variables column is sorted in order of importance for that cluster. Scroll through the column and review characteristics of the Bike Buyer High cluster. For example, they are more likely to have a short commute.
Double-click the Age cell in the Bike Buyers High column.
The Mining Legend displays a more detailed view and you can see the age range of these customers as well as the mean age.
Right-click the Bike Buyers Low column and select Hide Column.
With the Cluster Characteristics tab, you can examine in more detail the characteristics that make up a cluster. Instead of comparing the characteristics of all of the clusters (as in the Cluster Profiles tab), you can explore one cluster at a time. For example, if you select Bike Buyers High from the Cluster list, you can see the characteristics of the customers in this cluster. Though the display is different from the Cluster Profiles viewer, the findings are the same.
Unless you set an initial value for holdoutseed, results will vary each time you process the model. For more information, see HoldoutSeed Element
With the Cluster Discrimination tab, you can explore the characteristics that distinguish one cluster from another. After you select two clusters, one from the Cluster 1 list, and one from the Cluster 2 list, the viewer calculates the differences between the clusters and displays a list of the attributes that distinguish the clusters most.
To explore the model in the Cluster Discrimination tab
In the Cluster 1 box, select Bike Buyers High.
In the Cluster 2 box, select Bike Buyers Low.
Click Variables to sort alphabetically.
Some of the more substantial differences among the customers in the Bike Buyers Low and Bike Buyers High clusters include age, car ownership, number of children, and region.
See the following topics to explore the other mining models.