Time Series Predictions using Replacement Data (Intermediate Data Mining Tutorial)
Applies To: SQL Server 2016 Preview
In this task, you will build a new model based on worldwide sales data. Then, you will create a prediction query that applies the worldwide sales model to one of the individual regions.
Remember that your analysis of the results of the original mining model revealed big differences between regions and between product lines. For example, sales in North America were strong for the M200 model, while sales of the T1000 model did not do as well. However, the analysis is complicated by the fact that some series didn’t have much data, or data started at a different point in time. Some data was also missing.
To address some of the data quality issues, you decide to merge the data from sales around the world, and use that set of general sales trends to build a model that can be applied to predict future sales in any region.
When you create predictions, you will use the pattern that is generated by training on worldwide sales data, but you will replace the historical data points with the sales data for each individual region. That way, the shape of the trend is preserved but the predicted values are aligned with the historical sales figures for each region and model.
The process of using data from one series to predict trends in another series is called cross-prediction. You can use cross-prediction in many scenarios: for example, you might decide that television sales are a good predictor of overall economic activity, and apply a model trained on television sales to general economic data.
In SQL Server Data Mining, you perform cross-prediction by using the parameter REPLACE_MODEL_CASES within the arguments to the function, PredictTimeSeries (DMX).
In the next task, you will learn how to use REPLACE_MODEL_CASES. You will use the merged world sales data to build a model, and then create a prediction query that maps the general model to the replacement data.
It is assumed that you are familiar with how to build data mining models by now, and so the instructions for building the model has been simplified.
To build a mining structure and mining model using the aggregated data
In Solution Explorer, right-click Mining Structures, and then select New Mining Structure to start the Data Mining Wizard.
In the Data Mining Wizard, make the following selections:
Algorithm: Microsoft Time Series
Use the data source that you built earlier in this advanced lesson as the source for the model. See Advanced Time Series Predictions (Intermediate Data Mining Tutorial).
Data source view: AllRegions
Choose the following columns for the series key and time key:
Key time: ReportingDate
Choose the following columns for Input and Predict:
For Mining structure name, type: All Regions
For Mining model name, type: All Regions
Process the new structure and the new model.
To build the prediction query and map the replacement data
If the model is not already open, double-click the AllRegions structure, and in Data Mining Designer, click the Mining Model Prediction tab.
In the Mining Model pane, the model AllRegions should already be selected. If it is not selected, click Select Model, and then select the model, AllRegions.
In the Select Input Table(s) pane, click Select Case Table.
In the Select Table dialog box, change the data source to T1000 Pacific Region, and then click OK.
Right-click the join line between the mining model and the input data and select Modify Connections. Map the data in the data source view to the model as follows:
Verify that the ReportingDate column in the mining model is mapped to the ReportingDate column in the input data.
In the Modify Mapping dialog box, in the row for the model column AvgQty, click under Table Column and then select T1000 Pacific.Quantity. Click OK.
This step maps the column you created in the model for predicting average quantity to the actual data from the T1000 series for sales quantity.
Do not map the column Region in the model to any input column.
Because the model aggregated the data across all series, there is no match for the series values such as T1000 Pacific, and an error is raised when the prediction query runs.
Now you will build the prediction query.
First, add a column to the results that outputs the AllRegions label from the model together with the predictions. This way you know that the results were based on the general model.
In the grid, click the first empty row, under Source, and then select AllRegions mining model.
For Field, select Region.
For Alias, type Model Used.
Next, add another label to the results, so that you can see which series the prediction is for.
Click an empty row, and under Source, select Custom Expression.
In the Alias column, type ModelRegion.
In the Criteria/Argument column, type 'T1000 Pacific'.
Now you will set up the cross-prediction function.
Click an empty row, and under Source, select Prediction Function.
In the Field column, select PredictTimeSeries.
For Alias, type Predicted Values.
Drag the field AvgQty from the Mining Model pane into the Criteria/Argument column by using the drag and drop operation.
In the Criteria/Argument column, after the field name, type the following text: ,5, REPLACE_MODEL_CASES
The complete text of the Criteria/Argument text box should be as follows: [AllRegions].[AvgQty],5,REPLACE_MODEL_CASES
You might have noticed a problem with cross-prediction: namely, that to apply the general model to a different data series, such as the T1000 product model in the North America region, you must create a different query for each series, so that you can map the each set of inputs to the model.
However, rather than building the query in the designer, you can switch to DMX view and edit the DMX statement that you created. For example, the following DMX statement represents the query that you just built:
SELECT ([All Regions].[Region]) as [Model Used], ('T-1000 Pacific') as [ModelRegion], (PredictTimeSeries([All Regions].[Avg Qty],5, REPLACE_MODEL_CASES)) as [Predicted Quantity] FROM [All Regions] PREDICTION JOIN OPENQUERY([Adventure Works DW2003R2], 'SELECT [ReportingDate] FROM ( SELECT ReportingDate, ModelRegion, Quantity, Amount FROM dbo.vTimeSeries WHERE (ModelRegion = N''T1000 Pacific'') ) as [T1000 Pacific] ') AS t ON [All Regions].[Reporting Date] = t.[ReportingDate] AND [All Regions].[Avg Qty] = t.[Quantity]
To apply this to a different model, you simply edit the query statement to replace the filter condition and to update the labels associated with each result.
For example, if you change the filter conditions and column labels by replacing 'Pacific' with 'North America', you will get predictions for the T1000 product in North America, based on the patterns in the general model.