Performance Study of Microsoft Data Mining Algorithms
Sanjay Soni  UNISYS
Zhaohui Tang  Microsoft
Jim Yang – Microsoft
On This Page
Introduction
The Microsoft Data Mining Algorithms
Data Mining Terminology
TestSystem Configuration
Building Data Mining Models Using Analysis Services 2000
Performance Results for Microsoft Decision Trees Algorithm
Performance Study for Microsoft Clustering Algorithm
Conclusion
Acknowledgments
Introduction
In today's ebusiness environment, data mining is beginning to garner more and more attention. Because data mining is about exploration and analysis, by automatic or semiautomatic means, quantities of data can help to uncover meaningful patterns and rules. These patterns and rules help corporations improve their marketing, sales and customer support operations to better understand their customers. Over the years, corporations have accumulated very large databases from applications such as Enterprise Resource Planning (ERP), Client Relationship Management (CRM), or other operational systems. People believe there are untapped values hidden inside these data. Data mining techniques can help get these patterns out of this data.
Recently, Microsoft® has initiated the OLE DB for Data Mining application programming interface (API) with a number of leading data mining providers. This API defines a data mining query language based on SQL syntax. Data mining models are treated as a special type of relational table. Prediction operations are considered as a special kind of join. Microsoft SQL Server™ 2000 Analysis Services introduces the Microsoft data mining provider, which is based on the OLE DB for Data Mining standard. This provider includes two data mining algorithms: Microsoft Decision Trees and Microsoft Clustering, both patented by Microsoft Research.
This paper is organized in two parts. The first part of the paper provides a brief presentation of the decision tree and clustering algorithms, and explains a few basic data mining terms. Also, using a banking business scenario, this part shows how to use Analysis Services to build data mining models to solve a few business problems.
The second part of the paper focuses on the performance study of the two Microsoft data mining algorithms with synthetic data generated using a variety of parameters. Hundreds of experiments were conducted, and their results are presented in this part. The experiments were based on different parameter configurations of the training dataset. For example, the parameters varied for the number of input attributes, the sample size of training cases, the number of states (values) of each column, and so on. The results of these experiments prove that both data mining algorithms are very efficient and scalable.
The Microsoft Data Mining Algorithms
The two data mining algorithms shipped in SQL Server 2000, Microsoft Decision Trees (MDT) and Microsoft Clustering, are the result of many years of research work at Microsoft Research. This section provides a brief introduction of both algorithms. For more information about these algorithms, see http://research.microsoft.com.
Microsoft Decision Trees
The decision tree is probably the most popular technique for predictive modeling. An example explains some of the basics of the decision tree algorithm.
The following table shows a set of training data that could be used to predict credit risk. In this example, fictionalized information was generated on customers that included their debt level, income level, what type of employment they had and whether they were a good or bad credit risk.
Customer ID 
Debt level 
Income level 
Employment type 
Credit risk 

1 
High 
High 
Selfemployed 
Bad 
2 
High 
High 
Salaried 
Bad 
3 
High 
Low 
Salaried 
Bad 
4 
Low 
Low 
Salaried 
Good 
5 
Low 
Low 
Selfemployed 
Bad 
6 
Low 
High 
Selfemployed 
Good 
7 
Low 
High 
Salaried 
Good 
The following illustration shows a decision tree that might be created from this data.
In this example, the Decision Tree algorithm might determine that the most significant attribute for predicting credit risk is debt level. The first split in the decision tree is therefore made on debt level. One of the two new nodes (Debt = High) is a leaf node, containing three cases with bad credits and no cases with good credit. In this example, a high debt level is a perfect predictor of a bad credit risk. The other node (Debt = Low) is still mixed, having three good credits and one bad credit case.
The decision tree algorithm then chooses employment type as the next most significant predictor of credit risk. The split on employment type has two leaf nodes indicating that selfemployed people have a higher bad credit probability. This is, of course, a small example based on synthetic data, but it illustrates how the decision tree can use known attributes of the credit applicants to predict credit risk. In reality there are typically far more attributes for each credit applicant, and the numbers of applicants would be very large. When the scale of the problem expands, it is difficult for a person to manually extract the rules to identify good and bad credit risks. The classification algorithm can consider hundreds of attributes and millions of records to come up with the decision tree that describes rules for credit risk prediction.
There are many variations of algorithms that construct decision trees and that use different splitting methods: tree shapes, pruning techniques, and so on. Microsoft Decision Trees is a Probabilistic Classification Tree. It is very similar to C4.5, but instead of using Entropy as the splitting criteria, Microsoft Decision Trees uses a Bayesian score as the default.
Microsoft Clustering
The Microsoft Clustering algorithm is based on the Expectation and Maximization (EM) algorithm. This algorithm iterates between two steps. In the first step, called the E or "expectation" step, the cluster membership of each case is calculated. In the second step, called the M or "Maximization" step, the parameters of the models are reestimated using these cluster memberships. EM is similar to KMeans, which has the following major steps:

Assign initial means.

Assign cases to each mean using some distance measure.

Compute new means based on members of each cluster.

Assign new bin boundaries based on new means.

Cycle until convergence.
EM is different from KMeans in several aspects. The major difference is that EM has no strict boundary among clusters. A case is assigned to each cluster with a certain probability. The following illustration displays a few iterations of the EM algorithm for a onedimensional dataset. The data in each cluster is assumed to have a Gaussian distribution. The means of each cluster is shifted, subsequent to each iteration.
Most clustering algorithms must load all the data points into memory, which can cause serious scalability problems when processing a large dataset. The Microsoft Clustering algorithm uses a scalable framework, which selectively stores important portions of the database and summarizes other portions. The basic idea is to load data into memory buffers in chunks, and based on the updated data mining model, to summarize cases that are close together as Gaussian distribution, thereby compressing those cases. Microsoft Clustering algorithm needs to scan the raw data only once.
Data Mining Terminology
This section provides a brief introduction of some data mining terminology used in this paper. These terms are introduced in the Microsoft OLE DB for Data Mining specification, available for downloading at http://msdn2.microsoft.com/library/bb545450.aspx.
Data Mining Model
A data mining model is similar to a relational table. It contains a number of key columns, input columns, and predictable columns. A model is associated with a data mining algorithm. After the training stage, the data mining model stores patterns discovered by the data mining algorithm about the dataset. A data mining model can be thought of as a "truth table" containing rows for every possible combination of the distinct values for each column of the model. Once trained, a model can be used for prediction.
Columns
A data mining model column is similar to a column in a relational table, also called a "variable" or "attribute" in statistical terminology. There are three different types of columns in a data mining model: an input column, a predictable column, or a column that is both input and predictable. A data mining model uses the set of input attributes of the case to predict the output attributes. In this paper, columns and attributes are used interchangeably.
States
Each attribute may have a set of possible values. These values are called the state of the attribute.
Cases
Data mining is about analyzing cases. A case is the basic entity of information. A case can be simple, for example, when analyzing a customer loan risk, the customer information is a case. A case can be more complicated. For example, a data mining model might predict the list of products the customer will buy, based on customer demographics. The model shown in the following illustration is slightly more complicated than a simple case. This model combines demographic information with the list of products the customer buys. This type of case is called a nested case. Statistically speaking, the cases making up a dataset are assumed random and drawn from a fixed underlying distribution. In this paper, the term "sample size" is used to represent the number of cases.
Case Tables and Nested Tables
A case table is the table containing the case information related to the nonnested part of the data. A nested table is the table that contains information related to the nested part of the data. In the previous example, there are two input tables to the mining model. One table contains information about customer demographics. It is a case table. The other table contains information about customer purchases. It is a nested table. A nested table is similar to a transaction table in database terminology.
TestSystem Configuration
The hardware configuration used for the set of experiments described in this paper is:

One Unisys e@ction Aquanta ES5045R server running Microsoft Windows® 2000 Advanced Server and Microsoft SQL Server 2000 with Analysis Services 2000, which contains the relational database and the data mining models. This computer has the following configuration:

Four Intel Xeon 550 MHz CPUs

512MB cache

4GB RAM


Unisys OSM7700 Fibre channel data storage

RAID 5 disk arrays using five 9GB disks

Network: 100MB Ethernet
Building Data Mining Models Using Analysis Services 2000
This section describes a few business questions, which can be answered using data mining techniques, and explains how to build models to answer these questions using Analysis Services 2000. This section also shows how to train and browse data mining models and how to predict based on the trained models.
A Few Business Questions
Data mining finds hidden patterns inside datasets. These patterns can be used to solve many business problems. The following table presents a few business questions that are difficult to answer without data mining.
Question number 
Question (Data Mining Algorithm) 

1 
Identifying those customers that are most likely to churn (leave) based on customer demographical information. 
2 
Grouping heterogeneous customers into subgroups based on customer profile to generate a mailing list for marketing purposes. 
3 
Finding the list of other products that the customer may be interested in based on the products the customer has purchased. 
4 
Grouping customers into more or less homogeneous groups based on the customer profile and the list of banking products they have subscribed to. 
Data Source for the Data Mining Model
To answer the business questions presented in the previous topic ("A Few Business Questions"), the following relational database tables are used:

Customer table
This table contains demographic information about bank customers. Demographic information includes the customer's age, income, education level, house value, loan, and so on.

Purchases table
This table contains purchase information about the bank products that a customer has. This table has information about bank products such as checking accounts, money market savings, and so on.
As shown in the following illustration, the Customer table is linked to the Purchases table by the purchases that a customer makes. In relational terminology, the Purchases table makes a foreign key reference to the Customer table.
Creating Data Mining Models
When you create a data mining model (DMM), you define the structure and properties of the model. According to the Microsoft OLE DB for Data Mining API, a new DMM is created using the CREATE MINING MODEL statement. In relational databases, the CREATE TABLE statement defines the structure and properties of a relational table, including column names and data types, constraints, and storage options. Similarly, the CREATE MINING MODEL statement defines the keys, columns, and the specific algorithm and parameters to be used in the DMM training operation of a model.
You can create data mining models from Analysis Manager. The wizard automatically generates the CREATE MINING MODEL statements.
Creating a DMM Using Microsoft Decision Trees Without a Nested Table
To answer business Question 1, or identifying those customers that are most likely to churn (leave) based on customer demographical information, create a data mining model using the Microsoft Decision Trees algorithm. Choose CustomerID as the case key column. As shown in the following illustration, all the demographic information is selected as input columns, and the Churn_Yes_No column is selected as the output attribute to be predicted.
When you process this model named Model1_MDT_Nonnested, the data mining model is created, and the following CREATE statement, based on the OLE DB for Data Mining syntax, is generated and sent to the Microsoft Data Mining Provider transparently:
MiningModel 'Model1_MDT_NonNested' Execute : CREATE MINING MODEL [Model1_MDT_NonNested'S] ([Customer Id] LONG KEY, [Income] DOUBLE CONTINUOUS , [Other Income] DOUBLE CONTINUOUS , [Loan] DOUBLE CONTINUOUS , [Age] DOUBLE CONTINUOUS , [Region Name] TEXT DISCRETE , [Home Years] DOUBLE CONTINUOUS , [House Value] DOUBLE CONTINUOUS , [Education Level] TEXT DISCRETE , [Home Type] TEXT DISCRETE , [Churn Yes No] TEXT DISCRETE PREDICT) USING Microsoft_Decision_Trees
The syntax used to define the data mining model columns is similar to the CREATE TABLE statement of standard SQL. The key words LONG, DOUBLE and TEXT define the data types of the columns and are very similar to their SQL equivalents. There are a few extensions, however. The keyword KEY designates a Key contenttype column (or columns) that uniquely identifies a row in the data mining model. The keywords CONTINUOUS and DISCRETE are two of the possible values for Attribute contenttype columns. The keyword PREDICT designates the predictable column of the data mining model.
Processing (Training) Data Mining Models
After you create the data mining model, the next step is to train the model. Training the model means running the model against training data using a particular algorithm. Training is usually the most timeconsuming step. The algorithm may iterate over the training dataset a few times to find the hidden patterns. The OLE DB for Data Mining API hides the complexity of the model training from the consumer application. The API uses the INSERT statement to specify the training command. The syntax of this INSERT statement is the same as standard SQL, which populates a table with data. Although massive quantities of data are introduced to the data mining model, the model does not store any of the data but stores the patterns within the data instead. Once the model is trained, the client application can browse the model's content and perform queries on the new dataset.
The following illustration shows the training process. For information about the training performance, see the sections "Performance Results for Microsoft Decision Trees Algorithm" and "Performance Study for Microsoft Clustering Algorithm" later in this paper.
Browsing Mining Model Contents
Once the model is trained, you can browse the model contents in the Analysis Manager using the tree browser. In this browser, the content is displayed graphically—this enables navigating through different portions of the contents. Browsing the content can provide important insight into the data, allowing data analysts to understand the patterns and rules and later apply these rules to predict new datasets. The tree, shown in the following illustration, represents the pattern found by the Decision Trees algorithm over the training dataset.
As shown in the preceding tree, the Decision Tree algorithm finds that Income is the most significant attribute for predicting whether a customer will churn or not. The algorithm divides the Income attribute into four branches. The Decision Tree algorithm then chooses Age as the next most significant predictor. At the third level of prediction, the algorithm selects Education Level for those cases under $49,923.75 income, and House Value for those cases between $100,040.25 and $124,517.25. Statistics about credit distribution can be found on each node.
Prediction Using Mining Models
After the model is trained, you can use it to do predictions on new datasets.
In SQL Server 2000 Data Transformation Services (DTS), there is a new task called Prediction, which can be used in creating DTS Packages. The following illustration shows the Data Mining Prediction Query Task dialog box.
Based on the OLE DB for Data Mining API, prediction statements are SELECT statements that join a data mining model with a new input table. This special kind of join is called PREDICTION JOIN. The general syntax for prediction is:
SELECT [FLATTENED] <SELECTexpressions> FROM <mining model name> PREDICTION JOIN <source data query> ON <join condition> [WHERE <WHEREexpression>]
Note: In the preceding illustration, you can see the exact statement of prediction for this SELECT statement.
Building Mining Models Using the Clustering Algorithm with a Nested Table
To answer business Question 4, or grouping customers into more or less homogeneous groups based on the customer profile and the list of banking products they have subscribed to, you must build a clustering model using a nested table.
Create a clustered model using a nested table:

In the Analysis Manager tree pane, rightclick the Mining Models folder, and then click New Mining Model. Select the Clustering algorithm.

Select CustomerID as the case key column from the Customer table. All the demographic information is selected as input columns from the Customer table, and all the purchase information is selected from the Purchases table.

After the model is created, edit the model in the Mining Model Editor.
On the initial creation of the mining model, the wizard assumes that the Purchases table is a lookup table; therefore, the "many" side of the relationship is initially toward the Customer table. For this example, use Purchases as a nested table. Click the link between these tables in the Mining Model Editor and reverse the relationship.
The following illustration shows the edited nested model.
Once you process this data mining model, the trained model looks like the one in this illustration.
From this fictitious data, the following observations can be made:

Cluster 1
Most of the customers in this cluster are around 50 years of age, have an average income of $79,000 with no extra income source, and many of them stay in certain regions. The customers own many certificates of deposit (CDs) and money market savings accounts, and so on.

Cluster 2
The customers in this cluster average 40 years of age, possess an average income of $56,000 with an average additional income source of around $42,000, and many of them stay in certain other regions. Customers in this cluster own a maximum number of certificates of deposit (CDs) and savings accounts, but a minimum number of credit cards.

Cluster 3
The customers in this cluster average 65 years of age, have an average income of $56,347 with average additional income source of $42,645, and many of them stay in certain other regions. Customers in this cluster own a maximum number of credit cards and money market accounts.
Based on this information, the Marketing Department can send appropriate mailings for different products to different clusters of customers. The response rate from these mailings will be much higher than if a randomly picked portion of the customer base were used.
Performance Results for Microsoft Decision Trees Algorithm
The topics in this section present the results of a performance study on Microsoft Decision Trees using a variety of datasets. As described earlier in this paper, model training is the most resourceconsuming task for data mining. The experiments focus on the training performance. This section is organized in two parts. The first part covers Microsoft Decision Trees training performance without nested tables, and the second part shows the Decision Trees training performance when nested tables are involved.
The data is highly correlated generated data. Therefore, in the data and the trees created by the Decision Trees algorithm, there are many patterns that are larger than in reallife application.
In this study, six factors that can affect the scalability and the performance are considered:

Cardinality

Number of attributes

Number of states

Number of states in the nested key attribute

Sparseness of the table

Number of clusters for clustering analysis
In the approach to this study, one factor is changed, the rest of them are fixed each time, and the effect on training performance from the selected factor is studied.
Both algorithms are computation intensive. When the data structure is complicated, they consume a lot of memory for recoding. Therefore, a fast CPU and big memory will boost the performance. If you have multiple CPUs, you can save the process time by processing several mining models simultaneously.
Training Performance Study Without Nested Tables
In most situations, data analysts gather source data from an online transaction processing (OLTP) system. After data cleansing and data transformation steps, the training dataset is stored in a single case table. For example, for business Question 1 (Identifying those customers that are most likely to churn (leave) based on customer demographical information.), an analyst may want to predict customer churn risk based on customer demographic information. The input data in this case is a single case table: Customer.
Training Time vs. Number of Input Attributes
In this experiment, the sample size of training cases is fixed at 1 million. Each attribute has 25 different states. There is only one predictable attribute. The MDT training performance was studied by varying the number of input attributes from 10, 20, 50, 100, and 200. The following graph shows the training times for a varying number of input attributes.
This data shows that:

The Decision Tress algorithm scales linearly when the input attribute number increases.

The training performance is very fast. When there are 200 input attributes, it takes about 130 minutes for the Decision Trees algorithm to train 1 million cases.
The predictions for both algorithms are very fast. Because only one data loop is needed for prediction, the computation time is linear to sample size. For this test case, it takes 5 seconds to predict 100,000 cases. On average, it takes 0.05 milliseconds to predict 1 case.
Training Time vs. Sample Size
In this experiment, the sample size from 10,000 is increased to 10 million (10,000; 25,000; 50,000; 75,000; 100,000; 1 million; 10 million). The other dimensions remain fixed by using a case table with 20 attributes and 25 states per attribute. There is one predictable attribute. The following graph shows the training times for varying number of input cases.
This data shows that:

The training time of the Decision Trees algorithm scales linearly with the sample size.

It takes 20 seconds to train a model with 10,000 cases and 100 minutes for 20 million cases. The overall performance is very fast.
Training vs. Number of States of Attributes
In this experiment, the sample size is fixed to 1 million. There are 20 input attributes. The number of states is increased from 2, 5,10, 25 to 50.
The following graph shows the training times for varying number of states.
The data shows that:

The training time increases linearly when the number of states is less than 10.

When the number of states increases, it becomes difficult for the decision tree to find useful splits for the tree growth; thus, the tree depth is reduced and so is the training time. This phenomenon is an artifact of the test data. The data cleansing to recode the states is recommended. The number of states should not be more than 20.
Training Times vs. Number of Predictable Attributes
In this test, the number of predictable attributes is varied for the MDT model and the other variables remain fixed. There are 1 million cases with 40 attributes; each attribute has 25 states. Decision Trees builds a tree for each predictable attribute up to 255 trees. If there are more than 255 predictable attributes, the Decision Trees algorithm uses an attributeselection algorithm (feature selection) to pick the most important attributes for which to build trees.
The following graph shows the training performance as the number of predictable attributes is increased from 1, 2, 4, 16 to 32.
The data shows that:

The training time is slightly greater than linear according to the number of predictable attributes. It basically builds a tree for each predictable attribute, and the multiple trees building process is performed in a parallel way. The training time is slightly more than linear to the number of trees, possibly due to the overhead of switching between trees during the training process. Data cleansing to reduce the number of predictable columns is recommended.
Training Performance Study for MDT with Nested Tables
The nested table is a new concept introduced in OLE DB for Data Mining. Using nested tables makes modeling more powerful. For example, in business Question 3 (Finding the list of other products that the customer may be interested in based on the products the customer has purchased.), you can build a model with nested table to predict the list of banking products a customer may be interested in, based on his/her demographic information coupled with the list of banking products the customer has purchased before. Without the concept of the nested table, it is very difficult to analyze this data mining problem. For all experiments in this section, the built models are based on two tables: The main case table contains the customer demographic information, and the nested table contains the list of banking products that the customers have purchased.
Note: The states in a nested table correspond to attributes in the model.
Training Times vs. Number of States in a Nested Table
In this experiment, the number of states of the nested table key (the number of banking products) is varied. The main case table contains 200,000 customers (cases). Five input attributes from the main case table are used. Each customer randomly purchases from 0 to 50 different products, which are stored in the Products nested table. The nested table also has a column describing the quantity of each product purchased. The number of states of the nested table key is varied from 100 to 1000. As the customer purchase rate remains the same, the nested key (product) becomes sparse in the purchase table.
The following graph shows the training times for varying number of states in nested table.
This data shows that:

Training takes significantly more time compared to similar situations without a nested model.

When there are less than 200 nested table keys, the training time increases. After that (beyond 255 states), the training time starts to decrease. This decrease occurs because when the nested tablekey number is more than 255, the MDT algorithm uses feature selection techniques to pick the most important products, and the remaining products are predicted using a marginal model.

When the number of nested table key states is more than 255, as the customer purchase rate remains the same, the nested table key becomes sparser. Thus, there are few related patterns for each of the keys. The trees become smaller, and the training time decreases.
Training Times vs. Number of Products Purchased per Customer
In this experiment, the sample size is fixed to 200,000. The number of different products in the nested table remains at about 1000. The average number of products each customer buys is varied from 10, 25, 50 to 100.
The following graph shows the training times for varying number of products purchased per customer.
This data shows that:

Decision Trees scale linearly with the number of products purchased per customer. This is reasonable because training time is roughly linear with the number of input attributes.
Training Times vs. Sample Size of the Case Table
In this experiment, the average customer purchases is fixed at 25, the number of states in a nested table is fixed at 200, and the cardinality of the case table (more customers.) is increased The sample size of the case table varies from 10,000; 50,000; 100,000 to 200,000. As the key ratio (purchase rate) is fixed, the cardinality of the nested table increases, as well. The following graph shows the training times for varying the sample size of the case table.
The data shows that:

The training time increases linearly with the sample size. It takes about 250 minutes to train a model with 200,000 customers in the case table and 10 million rows in the nested tables.

It takes about 250 minutes to train a model with 200,000 customers in the case table and 10 million rows in the nested table. Note that without nested tables, the performance of training 10 million cases is only 100 minutes. It takes significantly more time for training a model with a nested table.
Performance Study for Microsoft Clustering Algorithm
The topics in this section describe the performance of the Microsoft Clustering algorithm across a variety of synthetic datasets. In particular, the sample size, the number of input attributes, the number of generated clusters, the variable content type (discrete or continuous), and the number of states of input variables are varied.
Performance Study Without Nested Tables
This section presents the performance study results when a nested table is not included as an input table.
Training Times vs. Number of Clusters
In this test, the sample size to is fixed 1 million; the number of attributes is fixed to 20, and the number of states per attribute is fixed to 20. The number of identifiable clusters is varied from 5, 10 to 20. Of the 20 input variables, 1 is predictable.
The following graph shows the training times for varying number of clusters.
The data shows that:

The training time is almost linear to the number of clusters. This result also means that the number of iterations required for the EM algorithm to converge is roughly independent of the number of clusters. EM works by iterations. For each iteration, EM calculates the likelihood function for each variable of input case on each cluster. Thus, the calculation time increases linearly with the number of clusters to be formed. Actually in the chart, the result seems slightly worse than linear. This is probably because the number of iterations may have increased slightly to find the best convergent point when number of clusters increased. For 1 million input cases, it takes around 90 minutes to form 5 clusters, 110 minutes to form 10 clusters, and 230 minutes to form 20 clusters.
Training Times vs. Number of Attributes
In this test, the sample size is fixed at 1 million, and the cluster number is fixed at 10. The number of attributes is increased for each case.
The following graph shows the training times for varying number of attributes.
The data shows that:

The training time is linear to the number of input attributes. The EM algorithm is very similar to KMeans. For each iteration, the algorithm is primarily calculating the likelihood value of each case on every dimension (input variables). Theoretically, the training performance should be linear with the number of dimensions. For 1 million cases, the training takes about 230 minutes with 50 input attributes. Again, this also means that the number of iterations required for the EM algorithm to converge is roughly independent of the number of attributes.

It takes a little more training time for continuous variables than for discrete variables. This is because the likelihood calculation for continuous variables is more complex than that of discrete variables.
Training Times vs. Sample Size
In this test, the number of cases is increased from 10,000; 25,000; 50,000; 75,000; 100,000 to 1 million. The other parameters remain fixed. A table with 20 attributes is used, with each attribute possessing 50 states, and the algorithm was requested to generate 10 clusters.
The following graph shows the training times for varying number of cases.
The data shows that:

The actual training time is linear to the sample size (it does not appear linear due to the scale of horizontal axes in the first graph). The following illustration shows this graph redrawn with a different scale.

It takes about 100 minutes training 1 million cases and 910 minutes for 10 million cases. The Microsoft Clustering algorithm training time is about eight times slower in comparison to the MDT algorithm training time.
Training Times vs. Number of States of Input Attribute
This experiment evaluates the impact of the number of states of each variable in relation to the training time. The number of states for each variable is varied from 10 to 100. The other dimensions remain at 1 million cases, 20 attributes, and 10 clusters.
The following graph shows the training times for the varying number of states.
The data shows that:

There is almost no impact on the algorithm performance in relation to the state number. The training time remains the same even as the state of input attributes changes. This is because the number of states does not affect the likelihood calculation complexity. Neither does this affect the number of iterations before convergence.

It takes a little more training time for continuous variables than discrete variables. For continuous variables, the calculation primarily uses the likelihood function. Discrete variable calculations are usually based on state counting. State counting takes less time than the complex math formula of the likelihood function.
Training Time with Nested Tables
The topics in this section describe the training performance of the Microsoft Clustering algorithm when nested tables are involved in the model. Two tables are used: the Customer case table and the Products nested table. This test is based on business Question 4 (Cluster customers based on their demographic information as well as the list of banking products they have purchased.).
Training Times vs. States in a Nested Key
In this test, there are 200,000 customers in the case table with five input attributes. Each attribute has 20 states. The number of nested table keys (Product) is increased from 100, 200, 500, to 1000. The average customer purchases is fixed at 25. When there are more states in the nested table key, the data becomes sparser.
The following graph shows the training times for varying number of states in nested table.
The data shows that:

The training time decreases when the number of states of the nested table increases. This result may be happening for two reasons. First, the attribute selection algorithm prevents the number of attributes from increasing beyond 255. Second, as the number of attributes increases, the data becomes sparser. Consequently, there are not enough patterns for the Cluster algorithm to determine the clusters to form, thus the algorithm uses fewer iterations for cluster tuning.

Case size (in terms of number of input attributes) is getting smaller due to feature selection. Some of the attributes are grouped together because the data is very sparse.
Training Times vs. Number of Products Purchased per Customer
In this experiment, the sample size is fixed to 200,000, and the number of keys (Products) in the nested table is fixed to 1000. The ratio of case key and nested key (for example, the average number of products each customer purchases) is increased from 10, 25, 50 to 100. The sample size of the nested table is increased accordingly.
The following graph shows the training times for the varying number of products purchased per customer.
The data shows that:

The training time is linear with the number of purchases per customer. It takes about 94 minutes to train when the number of products purchased per customer reaches 100 (about 20 million records in the nested table.) If you compare the result to the similar configuration using the Microsoft Decision Trees algorithm, you find the Clustering algorithm works faster with nested tables. The main reason is that MDT builds hundreds of decision trees when nested tables are used as a predictable variable. Clustering algorithms form the same number of clusters regardless of the involvement of a nested table.
Training Times vs. Cardinality of the Case Table
In this experiment, the ratio between the nested table key and the case table key is fixed at 50. The cardinality of the case table (Customers) is varied from 10,000; 50,000; 100,000 to 200,000. As the ratio between the case table and nested table is fixed, the cardinality of the nested table is changing accordingly.
The following graph shows the training times for a varying number of Master cases in nested table.
The data shows that:

The Clustering algorithm scales linearly with sample size.

When you compare the same configuration using the Decision Tree algorithm, the Microsoft Decision Trees algorithm is five times faster.
Conclusion
Data mining is quickly becoming a widely used analytical technique. This paper describes the two data mining algorithms included in SQL Server 2000 Analysis Services: Microsoft Decision Trees (MDT) and Microsoft Clustering. This paper also explains how to build mining models and to solve a few business problems. Performance results of training data mining models, using both algorithms under various dataset configurations, are shown. These results indicate that the Microsoft Data Mining Component is very fast and scalable. For example, it takes the Microsoft Decision Trees algorithm 100 minutes to train a mining model with 10 million cases and 25 attributes.
With SQL Server 2000 Analysis Services, data mining is no longer a reserved domain for statisticians. The complexity of the data mining algorithms is invisible to the user. Every database developer should be able to create and train data mining models and to embed these advanced features into their consumer applications.
Acknowledgments
We would like to acknowledge the efforts and talents of a few individuals who were instrumental in helping develop and enhance the content for this paper:

We are thankful to Dr. Raj Tewari, former Database Solutions Practice director at Unisys for his vision and inspiration to make this paper happen.

Thank you to David Heckerman, senior researcher at Microsoft Research for comprehensive and insightful reviews and valuable suggestions for this paper.

We are especially grateful to Wayne Kurtz, Database Solutions Practice director at Unisys for his thorough reviews and continued support for this paper.

Thank you to Len Wyatt, Program Manager for Data Warehouse and Analysis Practices at Microsoft for his excellent reviews and suggestions.

We are especially grateful to Diane Anderson, Manager for Marketing Communications (Global Industries) at Unisys. Without Diane's help and positive support, this paper and other Analysis Services papers from us would not have become a reality.

Finally, we want to thank the Analysis Services product team at Microsoft for their excellent feedback and reviews of this paper.