Data Mining Wizard (Analysis Services - Data Mining)

The Data Mining Wizard in Microsoft SQL Server Analysis Services starts every time that you add a new mining structure to a data mining project. The wizard helps you define new mining structures, and chooses the data sources that you will use for data mining. The wizard also can partition the data in the mining structure into training and testing sets, and help you add an initial mining model for each structure.

The content of a mining structure is derived from an existing data source view or cube. You can choose which columns to include in the mining structure. All models that are based on that structure can use those columns. You can enable users of a data mining model to drill down from the results of the mining model to see additional mining structure columns that were not included in the mining model itself.

You must make the following decisions when you create a data mining structure and model by using the Data Mining Wizard:

  • Whether to build the data mining structure and models from a relational database or from an existing cube in an OLAP database.

  • How much data to use in training, and how much to set aside for testing. When you partition a mining structure into training and testing data sets, all models that are based on that structure can use that testing set.

  • Which columns or attributes to use for prediction, and which columns or attributes to use as input for analysis. Each structure must also contain a key that uniquely identifies a case record.

  • Which algorithm to use. The algorithms provided in SQL Server Analysis Services have different characteristics and produce different results. You can create multiple models using different algorithms, or change parameters for the algorithms to create different models. .

The Data Mining Wizard provides functionality to help you make these decisions:

  • Wizard pages in which you define the case set. You can choose case tables and nested tables from a relational data source, or choose an OLAP data source and then select the case key and case level columns and then optionally set filters on the cube.

  • Dialog boxes that analyze the data in columns and recommend usage for the columns.

  • Auto-detection of column content and data types.

  • Automatic slicing of the cube, if your mining model is based on an OLAP data source.

After you complete the Data Mining Wizard, you use Data Mining Designer to modify the mining structure and models, to view the accuracy of the model, view characteristics of the structure and models, or make predictions by using the models.

For More Information:Data Mining Designer

Using the Data Mining Wizard

To start the Data Mining Wizard, add a new mining structure to an Analysis Services project by using Solution Explorer or the Project menu in Business Intelligence Development Studio.

The Data Mining Wizard has two branches, depending on whether your data source is relational or in a cube:

  • Relational Mining Models

  • OLAP Mining Models

Note

You do not need to have a cube or an OLAP database to do data mining. Unless your data is already stored in a cube, or you want to mine OLAP dimensions or the results of OLAP aggregations or calculations, we recommend that you use a relational table or data source for data mining.

Relational Mining Models

When you build a mining model from a relational data source in Analysis Services, you first specify in the Data Mining Wizard that you want to use an existing relational database to define the structure of the model. You also have the option of creating just the mining structure, or creating the mining structure and one associated data mining model. If you choose to create a mining model, you must specify the data mining technique to use, by selecting the algorithm that is most appropriate for the type of data mining analysis that you want.

For More Information:Data Mining Algorithms (Analysis Services - Data Mining)

Specifying the Data Source View and Table Types

The next steps in the wizard are to select the specific data source view that you want to use to define the mining structure, and to specify a case table. The case table will be used for training the data mining model, and optionally for testing it as well. You can also specify a nested table.

Selecting the case table is an important decision. The case table should contain the entities that you want to analyze: for example, customers and their demographic information. The nested table usually contains additional information about the entities in the case table, such as transactions conducted by the customer, or attributes that have a many-to-one relationship with the entity. For example, a nested tables joined to the Customers case table might include a list of products purchased by each customers, or a list of hobbies. For More Information:Nested Tables (Analysis Services - Data Mining)

Specifying the Column Usage

After you specify the case table and the nested tables, you determine the usage type for each column in the tables that you will include in the mining structure. If you do not specify a usage type for a column, the column will not be included in the mining structure.

Data mining columns can be one of four types: key, input, predictable, or a combination of input and predictable. Key columns contain a unique identifier for each row in a table. Some mining models, such as those based on the sequence clustering or time series algorithms, can contain multiple key columns. However, these multiple keys are not compound keys in the relational sense, but instead must be selected so as to provide support for time series and sequence clustering analysis. For more information, see Microsoft Time Series Algorithm or Microsoft Sequence Clustering Algorithm.

Input columns provide the information from which predictions are made. Predictable columns contain the information that you try to predict in the mining model.

For example, a series of tables may contain customer IDs, demographic information, and the amount of money each customer spends at a specific store. The customer ID uniquely identifies the customer and also relates the case table to the nested tables; therefore, you would use the customer ID as the key column. You could use a selection of columns from the demographic information as input columns, and the column that describes the amount of money each customer spends as a predictable column. You could then build a mining model that relates demographics to how much money a customer spends in a store. You could use this model as the basis for targeted marketing.

The Data Mining Wizard provides the Suggest feature, which is enabled when you select a predictable column. Datasets frequently contain more columns than you need to build a mining model. The Suggest feature calculates a numeric score, from 0 to 1, that describes the relationship between each column in the dataset and the predictable column. Based on this score, the feature suggests columns to use as input for the mining model. If you use the Suggest feature, you can use the suggested columns, modify the selections to fit your needs, or ignore the suggestions.

Specifying the Content and Data Types

After you select one or more predictable columns and input columns, you can specify the content and data types for each column.

For More Information:Data Types (Data Mining), Content Types (Data Mining)

Split Data into Training and Testing Sets

The final step before you complete the wizard is to partition your data into training and testing sets. The ability to hold out a portion of the data for testing is new in SQL Server 2008 and provides an easy-to-use mechanism for ensuring that a consistent set of test data is available for use with all mining models associated with the new mining structure.

You can specify that a certain percentage of the data be used for testing, and all remaining will be used for training. You can also specify the number of cases to use for testing. The definition of the partition is stored with the mining structure, so that whenever you create a new model based on the structure, the testing data set will be available for assessing the accuracy of the model.

For More Information:Validating Data Mining Models (Analysis Services - Data Mining), Partitioning Data into Training and Testing Sets (Analysis Services - Data Mining)

Completing the Wizard

The last step in the wizard is to name the mining structure and the associated mining model. If you select Allow drill through, the drill through functionality is enabled in the model. This lets users who have the appropriate permissions explore the source data that is used to build the model.

For More Information:Using Drillthrough on Mining Models and Mining Structures (Analysis Services - Data Mining)

Back to Top

OLAP Mining Models

When you build a multidimensional mining model from an OLAP data source in Analysis Services, you first specify in the Data Mining Wizard that you want to use an existing cube to define the structure of the model. You have the option of creating just the mining structure, or creating the mining structure plus one associated data mining model. If you choose to create a mining model, you must specify the data mining technique to use, by selecting the algorithm that is most appropriate for your business problem.

For More Information:Data Mining Algorithms (Analysis Services - Data Mining)

Specifying the Data Source and Case Key

Next, you select the cube dimension to use as the data source to define the mining structure. Then you select an attribute to use as the key, or case key, of the mining model.

Note

The OLAP mining model that you are building and the source cube that you use to create the model must be contained in the same Analysis Services database.

Specifying Case Level Columns and Column Usage

After you select a case key, the attributes and measures that are associated with that key are displayed in a tree view on the next page of the wizard. From this list, you select the attributes and measures to be used as the columns of the structure. These columns are known as case level columns. As with a relational model, you must also specify how each column should be used in the structure, which you can do on the next page of the wizard. Columns can be key, input, predictable, input and predictable, or unselected.

Adding Nested Tables

The OLAP branch of the Data Mining Wizard includes the option to add nested tables to the mining model structure. On the Specify Mining Model Column Usage page of the wizard, click Add Nested Tables to open a separate dialog box that guides you through the steps to add nested tables. Only the measure groups that apply to the dimension are displayed. Select a measure group that contains the foreign key of the case dimension. Next, specify the usage for each column in the measure group, either input or predictable. The wizard then adds the nested table to the case table. The default name for the nested table is the nested dimension name, but you can rename the nested table and its columns. For More Information:Nested Tables (Analysis Services - Data Mining)

Specifying the Content and Data Types

After you select one or more predictable columns and input columns, you can specify the content and data types for each column.

For More Information:Data Types (Data Mining), Content Types (Data Mining)

Slicing the Source Cube

In the OLAP branch of the wizard, you can limit the scope of your mining model by slicing the source cube before you train the mining model. Slicing the cube is similar to adding a WHERE clause to an SQL statement. For example, if a cube contains information about the purchase of products, you might limit an age attribute to more than 30, a gender column to only female, and a purchase date to no earlier than March 2000. In such a way you can limit the model to cover the scope of a female who is older than 30 years and who bought a product after March 2000.

Split Data into Training and Testing Sets

The final step before you complete the wizard is to partition the data that is available from the cube into training and testing sets. The definition of the partition is stored with the mining structure, so that whenever you create a new model based on the structure, the testing data set will be available for assessing the accuracy of the model.

For More Information:Validating Data Mining Models (Analysis Services - Data Mining), Partitioning Data into Training and Testing Sets (Analysis Services - Data Mining)

Completing the Wizard

The last step in the wizard is to name the mining structure and the associated mining model. If you select Allow drill through, the drill through functionality is enabled in the model. This lets users who have the appropriate permissions explore the source data that is used to build the model. You can also specify whether you want to add a new dimension to the source cube that is based on the mining model, or create a new cube from the mining model.

For More Information:Using Drillthrough on Mining Models and Mining Structures (Analysis Services - Data Mining)

Back to Top