Inventory Predictive Modeling via Microsoft SQL Server 2005 Analysis Services
SQL Server Technical Article
Published: September 2005
Applies To: SQL Server 2005
Summary: This paper describes an approach for building retail outofstock predictive models using SQL Server 2005 Analysis Services. When applied to Project REAL data, these models produced very accurate predictions. The retailer whose data was used in Project REAL could potentially increase sales by millions of dollars a year if these predictive models were deployed in the company.
On This Page
Background
About Project REAL
Data Warehouse Description
Data Mining Methodology and Modeling Dataset Construction
Predictive Modeling Results
Automation of Inventory Predictions
Conclusion
Appendix A: Store Clustering Attributes
Appendix B: OutofStock DMX Query
Background
National and international retailers offering a large number of products to customers face the common challenge of ensuring adequate product stock levels on hand across hundreds or thousands of stores. The problem of determining adequate stock levels balances the following competing costs.

The cost of storing high levels of stock. These costs refer to the price that is paid by the retailer for securing physical space, for extra supplier purchases, and for the distribution that is associated with maintaining high levels of product stock at all retail stores.

The cost of missed sales. These costs are incurred when a customer comes into a store and wants to purchase a given product, but is unable to because the product is out of stock.
There have typically been two options for the retailer faced with this dilemma. The retailer can either overstock and incur a highstock cost, or else keep the stock cost low and run the risk of missing sales opportunities because a product is unavailable at the time of the desired purchase. The optimum way to combat these competing costs is to utilize predictive modeling to ensure that each store in the chain has the correct inventory levels.
Retailers have historically relied on a combination of supply chain software, internal analysts, and intuition to forecast inventory needs. With increasing margin pressure, many retailers from the Chief Financial Officer to the inventory manager have focused on finding more accurate methods for forecasting inventory across their chain. Predictive analytics is the solution. It offers the ability to accurately forecast the right products to the right store locations.
This paper describes the use of Analysis Services in Microsoft® SQL Server™ 2005, coupled with a SQL Server data warehouse to employ data mining technology to provide accurate and uptodate information for product stocking decisions. The methodology presented here is designed to provide outofstock predictions at the store/product level. For a given product, SQL Server 2005 Analysis Services is used to build a data mining model that makes outofstock predictions for each store in the chain. This approach enables the retailer to effectively balance the competing costs associated with stocking product inventory.
About Project REAL
Project REAL is an effort to discover best practices for creating business intelligence (BI) applications that are based on SQL Server 2005. In Project REAL we are doing this by creating reference implementations based on actual customer scenarios. This means that customer data is brought inhouse and is used to work through the same issues that customers face during deployment. These issues include:

The design of schemas—both relational schemas and those used in Analysis Services.

The implementation of data extraction, transformation, and loading (ETL) processes.

The design and deployment of client frontend systems, both for reporting and for interactive analysis.

The sizing of systems for production.

The management and maintenance of the systems on an ongoing basis, including incremental updates to the data.
By working with real deployment scenarios, we gain a complete understanding of how to implement a BI system using SQL Server BI tools. Our goal is to attempt to address the full gamut of concerns that a company wishing to analyze potentially large data sets would face during their own realworld deployment.
This paper summarizes the data mining work that has been performed for Project REAL to date. A number of other papers describe the work that has been done and lessons learned in other areas. To find the latest information, visit the Project REAL Web site (http://www.microsoft.com/sql/bi/ProjectREAL/).
Data Warehouse Description
In Project REAL, a data warehouse was constructed to summarize sales data for millions of products for a retailer with several hundred stores nationwide. The relevant datasets used for the outofstock predictive modeling are:

Sales fact data that is aggregated at the store, product (item), day level. Specifically, daily sales are stored for each product that is sold, for each store in the retailer’s chain.

Inventory fact data that is aggregated at the store, product (item), day level. Specifically, this is the number of days that the product has been in stock, for each product, for each day, for each store in the retailer’s chain.

Product (item) information consisting of product name, description, retail price, and product category hierarchy.

Store information consisting of store description, store classification (for example, an indicator that specifies whether the store is large or small), store division, store region, store district, city, zip code, state, linear feet of shelf space, and other store information.

Date information (a date dimension) that maps factlevel date identifiers to appropriate fiscal weeks, months, quarters, years, and other date information.
Having a clean, uptodate data warehouse available provides a solid basis for all business intelligence applications to utilize this valuable information asset. In this particular outofstock modeling task, the data warehouse enabled easy construction of the modeling dataset.
Data Mining Methodology and Modeling Dataset Construction
Based upon the experience that we gained in applying data mining technology to various retailsales forecasting and outofstock modeling problems during Project REAL, we developed a twophase modeling process that increases the likelihood of accurate predictions.
Phase I of the modeling process consists of clustering stores in the retail chain based upon aggregate sales patterns. After quality storecluster models have been constructed, these clusters are used to more accurately make outofstock predictions at the store/product level during Phase II of the modeling process. Both phases are efficiently and effectively addressed by using data mining technology in SQL Server 2005 Analysis Services.
This section provides details on the overall outofstock prediction process starting with a description of the process that was used to create the modeling datasets. This is followed by a discussion of methodologies for evaluating data mining models that are constructed by using SQL Server 2005 Analysis Services.
Outofstock predictive modeling process
The outofstock modeling problem is approached in two phases.
Phase I consists of grouping together those stores that have similar aggregate sales patterns across the chain. The process of grouping together stores that have similar aggregate sales patterns is referred to as store clustering. Store clustering is accomplished by using the Microsoft Clustering algorithm that ships in SQL Server 2005 Analysis Services to group together stores having similar aggregate sales patterns. When applied to datasets consisting of aggregate sales patterns, the Microsoft Clustering algorithm attempts to group together stores into clusters in such a way that stores belonging to the same cluster are more similar than stores belonging to different clusters. The modeling dataset is based on aggregate sales data that is derived from the data warehouse. Hence, the “similarity” measure that is used to group together stores is computed over this aggregate sales data.
We then used the cluster models that were produced in Phase I to build more accurate outofstock predictive models in Phase II. This allows the predictive algorithms (such as Microsoft Decision Trees or Microsoft Neural Networks) to make use of the cluster results to improve predictive accuracy. Essentially, to optimize the predictions for a given product p in a given store s, the predictive algorithms in SQL Server 2005 may make use of the fact that the sales for the same product p in a similar store s’ may improve predictive accuracy when determining whether or not p will be out of stock in store s.
Highlevel steps for constructing outofstock predictive models for product p
The twophase process for building optimal predictive models by using SQL Server 2005 Analysis Services consists of the following highlevel steps. Details on these steps are provided in the next sections.

Use the product hierarchy in the product information (dimension) portion of the data warehouse to determine the product category c(p) for product p. We assume that products within the same category have similar aggregate sales patterns across the chain of stores. Hence the product hierarchy is used to identify the set of similar products c(p) for a given product p. Alternatively, a product clustering approach could be used to determine a datadriven grouping of products similar to p by clustering products based upon their sales across the chain of stores.

Prepare modeling dataset Dcluster for store clustering to capture storelevel properties and sales for category c(p) as determined in step 1.

Apply the Microsoft Clustering algorithm to the dataset Dcluster to obtain k clusters (groups) of those stores that are similar across storelevel properties and sales for category c(p).

For each cluster l = 1,…,k obtained in step 3:

Let S(l) be the set of stores that belong to cluster l. Recall that these stores have similar categorylevel aggregate sales, for the category c(p).

Create a dataset DOOS(p,S(l)) consisting of historic and current weekly sales aggregates, and changes in weekly sales aggregates, for each store s in S(l). In addition, include Boolean flags indicating whether or not product p was in stock or out of stock one week into the future and two weeks into the future.

Apply the predictive modeling algorithms in SQL Server 2005 Analysis Services (such as Microsoft Decision Trees or Microsoft Neural Networks) to the dataset DOOS(p,S(l)). Use the historic and current weekly sales aggregates as input attributes and the one and twoweek outofstock Boolean flags as output or predictonly attributes. This instructs SQL Server 2005 Analysis Services to generate a model that takes as its input the historic and current weekly sales, along with changes in weekly sales, and then make a prediction of the Boolean flags that indicate whether or not product p will be out of stock one and two weeks into the future.

The data preparation and modeling steps are described in more detail in the next two sections.
In the context of Project REAL, the retailer is Barnes & Noble. Project REAL (an acronym for Reference implementation, Endtoend, At scale, and Lots of users) is a cooperative effort between Microsoft Corporation, Apollo Data Technologies, and other elite technology partners including UNISYS, EMC2, ProClarity, Panorama, Scalability Experts, and Intellinet with the goal of creating a referenceable implementation of a BI system.
Using authentic enterprise data provided by Barnes & Noble, Project REAL collaborators were able to discover the best practices for creating BI applications that are based on Microsoft SQL Server 2005. This complete system works through the full gamut of customer operational challenges by analyzing large data sets in a comprehensive way.
The focus was on the following five products (books), all of which are within the same category (Chapter Books).

Captain Underpants & The Invasion of the Incredibly Naughty Cafeteria Ladies from Outer Space (Captain Underpants Series)

Junie B Jones Is a Graduation Girl

Dinosaurs: A Nonfiction Companion to Dinosaurs Before Dark (Magic Tree House Research Guide Series #1)

City in the Clouds (Secrets of Droon Series #4)

Twisters and Other Terrible Storms (Magic Tree House Research Guide Series)
Phase I: Store clustering
Recall that the goal of store clustering is to obtain groups of stores that have similar sales patterns, focused on sales over the products in the category to which product p belongs c(p).
Phase I begins with constructing the dataset that will be used for store clustering. To minimize the computational impact on the live retail sales and inventory data warehouse, we recommend creating a separate SQL database to store the datasets that are used for modeling with SQL Server 2005 Analysis Services.
Store clustering dataset construction
The dataset used for store clustering consisted of storelevel aggregate sales over the time period January 2004–December 2004. The dataset consists of a single table with the key StoreID. StoreID is an integer that uniquely identifies each store in the chain.
Since the goal of the store clustering task is to group together stores based upon similarity of aggregate sales patterns, we worked with the retailer to identify a set of aggregate sales attributes that would be useful for this exercise. The type and information content of the set of attributes that are used for modeling typically affects the output models that are produced. When crafting the set of attributes to be used for modeling, we find it beneficial to work with stakeholders who have a solid grasp of the underlying business processes. In addition, based upon work that we have done in the retail vertical, we can recommend attributes that may prove to be useful. For each store, attributes were aggregated over the fact data in the data warehouse. These saleslevel aggregates are as follows. For a detailed description of all storelevel attributes used for the store clustering problem, see Appendix A.

Categoryspecific derived attributes for the category to which product (book) p belongs (referred to as c(p) in the previous section). These are:

Category Average Weekly Modeled: Estimate of the number of books that were expected to be sold from the category, per week in the given store.

Category Average Weekly On Hand: Average weekly onhand (instock) values for the category in the given store.

Category Average Weekly On Order: Average number of books onorder per week, for the category in the given store.

Category Fraction Holiday Sales: Fraction of total holiday sales coming from books in category c(p) in the given store. Note that holiday sales were those occurring between November 15, 2004 and the end of December 2004.

Category Fraction Sales: Fraction of total nonholiday sales coming from books in category c(p) in the given store. Note that nonholiday sales were sales occurring between January 1, 2004 and November 14, 2004.

Category Holiday Discount Amount: Total amount that books in category c(p) were discounted during the holiday period, in the given store.

Category Holiday Markdown Amount: Total amount that books in category c(p) were marked down during the holiday period, in the given store.

Category Holiday Member Discount Amount: Total member discount given for books in category c(p) during the holiday period, in the given store.

Category Holiday Sales Amount: Total sales for books in category c(p) during the holiday period, in the given store.

Category Holiday Sales Quantity: Total quantity of books in category c(p) that were sold during the holiday period, in the given store.

Category Total Discount Amount: Total amount that books in category c(p) were discounted during the nonholiday period, in the given store.

Category Total Markdown Amount: Total amount that books in category c(p) were marked down during the nonholiday period, in the given store.

Category Total Member Discount Amount: Total member discount given for books in category c(p) during the nonholiday period, in the given store.

Category Total Sales Amount: Total sales for books in category c(p) during the nonholiday period, in the given store.

Category Total Sales Quantity: Total quantity of books in category c(p) that were sold during the nonholiday period, in the given store.


For each of the following categories, the fraction of total holiday sales coming from the category was computed (for example, Cat Frac Holiday Sales in Appendix A). Additionally, the fraction of total nonholiday sales coming from the category was computed (for example, the Cat Frac Sales attribute in Appendix A). Based upon feedback from the retailer, the categories considered to capture highlevel overall sales were: Beginning Reader, BG Bestseller, BGCKBKS Under 15, BG Reference, Blank Books, Board Block Touch, Chapter Books, Christian Insp, Cooking, Current Affairs, Family Child Care, Fantasy, Fiction, Fiction PB Young Readers, Hist Biog, Humor, Juv Activity, Juv Christmas, Juv Series HC, Juv Series PB, Juv Work Books, Literature, Magazines, Management, Manga Japanese, Mystery, New Age, Newspapers, Pict Sty Bks, Pop Rock, Romance, Science Fiction, Self Improvement, Single Cards, Spinner, Techno Thriller Espionage, and Teen Fiction.

The following storelevel aggregates were also included.

Total Holiday Sales: Total sales of all books at the store during the holiday period.

Total Sales: Total sales of all books at the store during the nonholiday period.

Total Weekly Average Modeled: Average weekly estimate of total number of books sold at store.

Total Weekly Average On Hand: Average weekly total of books onhand at the given store.

Total Weekly Average On Order: Average weekly total of books onorder at the given store.


The following store properties were also included in the store clustering dataset: City, Linear Ft (the number of linear feet of shelf space in the store), Square Feet (of the store), and State.
These storelevel properties and aggregate values were computed via SQL and stored in a single, denormalized table. Note that this table is used only for modeling via the data mining components in SQL Server 2005. If an organization wanted to update the storeclustering models on an ongoing basis, we would recommend automating the construction of this denormalized table as a data preparation step. Alternatively, a view could be defined instead of the table, thereby creating a denormalized result set from normalized fact and dimension data.
Each row of the table is indexed by the unique integer StoreID and contains a column for each property/aggregate listed in the previous list and described in detail in Appendix A. For both the store clustering exercise and outofstock modeling, stores were considered that had been open for at least one year. This resulted in 794 stores for this particular retailer. Hence the single SQL Server 2005 relational table that was used for store clustering consisted of 846 rows and 100 columns (1 column for the StoreID and 99 columns for storing the attribute values as defined in the previous list of attribute values).
Storeclustering mining model construction
After the source relational table is constructed, we proceed to the step of building the storeclustering mining model via Microsoft Visual Studio® 2005. This begins with the creation of an Analysis Services project within Visual Studio 2005, and the creation of a Data Source object that connects to the SQL Server instance containing the store clustering dataset. A data source view also must be created. This data source view selects only the single table that contains the storelevel properties and aggregate attributes. See Figure 1.
After the data source view is added, a new mining structure and mining model for the store clustering exercise is created. The mining structure defines the column structure that will be used to construct the storeclustering model. All attributes are selected as Input attributes except the Cat Fraction Sales and Cat Total Sales Qty attributes. These are selected as Predict (Input and Predictable). See Figure 2.
The CLUSTER_COUNT parameter associated with the Microsoft Clustering algorithm specifies the maximum number of clusters to search for in the source data. By default, the value is 10. With the goal of producing distinct clusters that sufficiently capture the correlations in store properties and aggregate sales/inventory values, the value of CLUSTER_COUNT was changed to 5 based upon experience and an assessment of the quality of the cluster model(s) that are found with 5 clusters. Typically, analysts will need to alter the CLUSTER_COUNT parameter to obtain the results desired. In this application, we found that distinct store clusters (with respect to similarity in aggregate sales) were found when we used CLUSTER_COUNT = 5. In addition, evidence suggested that better quality cluster models are obtained in this application when using MINIMUM_SUPPORT = 50. This instructs the Microsoft Clustering algorithm to identify only those clusters that have 50 or more cases (stores in this application) in them. Similarly, MINIMUM_SUPPORT would also be altered by an analyst to obtain the desired cluster quality. See Figure 3.
After setting the parameters for the Microsoft Clustering algorithm, the mining structure is processed, thereby creating and populating the mining model object in SQL Server 2005 Analysis Services.
Storeclustering model evaluation
After the storeclustering models have been constructed, they are evaluated by using the Microsoft SQL Server 2005 Analysis Server Cluster Browser to determine if indeed the clusters are distinguished by category sales patterns.
For a summary of the store clusters found by SQL Server 2005 Analysis Services, see Figure 4. The store clusters tend to be discriminated by the Total Sales, Category Sales Quantity, Category Weekly Sales, Category Weekly OnHand, and OnOrder values. Figure 4 shows, for each cluster, the example city/state values for the stores belonging to a given cluster (left column in Figure 4) and the discriminating factors for each cluster (right column in Figure 4).
Note that discriminating features (attribute/value) pairs can be determined by using the Discrimination tab in the SQL Server 2005 Analysis Services Cluster Model Viewer (see Figure 5).
Phase II: Outofstock predictive modeling
Now that the storecluster model(s) that group together stores having similar category sales patterns have been constructed, we focus on the problem of predicting whether or not a given book will be out of stock one week into the future and two weeks into the future. Prior to building the mining models to make the outofstock predictions, we construct modeling datasets for each product (book) of concern.
Outofstock predictive modeling dataset construction
The dataset used for the outofstock predictive model task takes into account weekly sales data for a given book across all stores in the retail chain. Based upon experience and the amount of historical data available, we developed a sliding window strategy to create the dataset used for predictive modeling. The sliding window strategy typically is a good data preparation strategy when the data has a temporal nature (for example, when predictions are made into the future) and the type of the predictable quantity is discrete (such as Boolean outofstock indicators or sales bins). If there is sufficient temporal data and the predictable quantity is inherently numeric, timeseries modeling may be a preferred strategy.
By exploring the data and focusing on the context of the problem, we made the following observation. At a weekly time period, historical sales and inventory data for a given book and given store produced 52 records (one year’s worth of data). Typically, there are very few outofstock events that occur for a single store and single product. To obtain accurate predictive models, the training data needs to include a sufficient number of outofstock events and instock events to identify trends differentiating the two. The following data preparation strategy was aimed at achieving a sufficient number of outofstock events and instock events by considering a given product p over the entire chain of stores. We included the store cluster label (derived from the storecluster model) to allow the predictive modeling algorithms to identify trends in outofstock behavior that might be different between different store clusters.
Outofstock predictive modeling dataset construction for a given product (book) p
For each store s in the retail chain:
For each week between January 1, 2004 – December 31, 2004:

A unique store/week identifier is generated. This will be the key for the outofstock modeling dataset.

The store cluster label that store s belongs to, as determined by the storecluster model(s) described in Phase I: Store clustering.

The sales of product (book) p in store s in the current week (CurrentWeekSales).

The onhand quantity of the product (book) p in store s in the current week (CurrentWeekOnHand).

The onorder quantity of the product (book) p in store s in the current week (CurrentWeekOnOrder).

The sales of product (book) p in store s one week ahead (into the future) of the current week (OneWeekAheadSales).

The sales of product (book) p in store s two weeks ahead (into the future) of the current week (TwoWeeksAheadSales).

The sales of product (book) p in store s one week behind (in the past) of the current week (OneWeekBackSales).

The sales of product (book) p in store s two weeks behind (in the past) of the current week (TwoWeeksBackSales).

The sales of product (book) p in store s three weeks behind (in the past) of the current week (ThreeWeeksBackSales).

The sales of product (book) p in store s four weeks behind (in the past) of the current week (FourWeeksBackSales).

The sales of product (book) p in store s five weeks behind (in the past) of the current week (FiveWeeksBackSales).

The onhand quantity of product (book) p in store s one week ahead (in the future) of the current week (OneWeekAheadOnHand).

The onhand quantity of product (book) p in store s two weeks ahead (in the future) of the current week (TwoWeekAheadOnHand).

The onhand quantity of the product (book) p in store s one week behind (in the past) of the current week (OneWeekBackOnHand).

The onhand quantity of the product (book) p in store s two weeks behind (in the past) of the current week (TwoWeeksBackOnHand).

The onhand quantity of the product (book) p in store s three weeks behind (in the past) of the current week (ThreeWeeksBackOnHand).

The onhand quantity of the product (book) p in store s four weeks behind (in the past) of the current week (FourWeeksBackOnHand).

The onhand quantity of the product (book) p in store s five weeks behind (in the past) of the current week (FiveWeeksBackOnHand).

The onorder quantity of the product (book) p in store s one week behind (in the past) of the current week (OneWeekBackOnOrder).

The onorder quantity of the product (book) p in store s two weeks behind (in the past) of the current week (TwoWeeksBackOnOrder).

The onorder quantity of the product (book) p in store s three weeks behind (in the past) of the current week (ThreeWeeksBackOnOrder).

The onorder quantity of the product (book) p in store s four weeks behind (in the past) of the current week (FourWeeksBackOnOrder).

The onorder quantity of the product (book) p in store s five weeks behind (in the past) of the current week (FiveWeeksBackOnOrder).

The sales bin for sales of product (book) p in store s one week ahead (into the future) (OneWeekSalesBin). The sales bins used for this exercise are:

None

1 to 2

3 or more


The sales bin for sales of product (book) p in store s two weeks ahead (into the future) (TwoWeekSalesBin). The sales bins used for this exercise are:

None

1 to 2

3 or more


A Boolean flag indicating whether or not the product (book) p will be out of stock one week ahead (into the future) or not (OneWeekOOSBoolean).

A Boolean flag indicating whether or not the product (book) p will be out of stock two weeks ahead (into the future) or not (TwoWeekOOSBoolean).

The change in sales of product (book) p in store s between the current week and one week behind (in the past) (FirstWeekSalesChange).

The change in sales of product (book) p in store s between one week behind the current week and two weeks behind (in the past) (SecondWeekSalesChange).

The change in sales of product (book) p in store s between two weeks behind the current week and three weeks behind (in the past) (ThirdWeekSalesChange).

The change in sales of product (book) p in store s between three weeks behind the current week and four weeks behind (in the past) (FourthWeekSalesChange).

The change in sales of product (book) p in store s between four weeks behind the current week and five weeks behind (in the past) (FifthWeekSalesChange).
This is a lengthy list of attributes. In typical data mining situations, it is common to have hundreds, and even thousands or more attributes describing the entity to be modeled. The data mining algorithms will attempt to identify the pertinent correlations for making accurate predictions. Since the pertinent correlations are not known apriori, it is common to include all possible attributes in the training dataset. For example, in predicting the Boolean flag indicating twoweek out of stock, the following attributes were useful (see Figure 13).

Current Week On Hand

Four Weeks Back On Hand

One Week Back Sales

Current Week Sales

Cluster Label (from the storeclustering model)

Four Weeks Back Sales

Five Weeks Back On Hand

Two Weeks Back Sales
But, prior to modeling, it was impossible to know that these attributes would be relevant.
Note that there is a row in this training table for each store/week pair. Also note that the attributes FirstWeekSalesChange, SecondWeekSalesChange, …, FifthWeekSalesChange help to approximate the first derivative (change) in sales week over week. Typically, these types of attributes can be very useful in improving a model’s predictive accuracy.
To more objectively evaluate the predictive accuracy of the models built using SQL Server 2005 Analysis Services, it is common practice to hold out a subset of data and call this the testing set. The remainder of the dataset is called the training dataset. The data mining models are constructed using the training dataset. Predictions from the model are then compared with the actual values over the testing set.
For the data for the training dataset in this application, we used all records in the dataset corresponding to the weeks between January 1, 2004 and November 30, 2004. For the testing set, we used all records in the dataset corresponding to the weeks between December 1, 2004 and December 31, 2004.
The training dataset corresponding to a given product (book) p consisted of 10,635 rows. (Note that there is a row for each store/week pair for the weeks between January 1, 2004 and November 30, 2004.) In addition to the store/week identifier (key) column, the training dataset consisted of 38 columns. Other columns (such as Store_ID and Week_ID) were also included in the dataset, but not used for modeling.
The testing dataset corresponding to a given product (book) p consisted of 2,442 rows. (Note that there is a row in for each store/week pair for weeks between December 1, 2004 and December 31, 2004.) There were also 38 columns in addition to the unique store/week identifier (key) in the testing dataset.
Outofstock mining model construction
After the source relational tables were constructed, the predictive data mining models were built via Visual Studio 2005. To accomplish this, first an Analysis Service project and a data source are created that specify the SQL Server instance that stores the training and test tables for the products (books) under consideration. A data source view that selects the tables of interest is created. See Figure 6.
After the data source view is added, a new mining structure is created for the outofstock predictive modeling exercise. The current weekly sales, historic weekly sales, onhand, and onorder attributes are specified as inputs. The Boolean outofstock flags and the weekly salesbin attributes (OneWeekOOSBoolean, TwoWeekOOSBoolean, OneWeekSalesBin, TwoWeekSalesBin) are specified as predictable (Predict Only) attributes.
Microsoft Decision Trees and Microsoft Neural Network models are built to determine which algorithm produces the most accurate models (as measured by comparing predictions with actual values over the testing set). After an initial mining structure and mining model is built (specifying the input and predictable attributes), the analyst can easily add other mining models. (You can try different algorithms by using the Add Mining Model feature.) Note that in Figure 7, Input indicates that the attribute value will be used as an input into the predictive model. PredictOnly indicates that these values should be predicted by the data mining model. Key indicates the column that uniquely identifies the case of interest. The analyst can also set an attribute to be of type Predict. This attribute type indicates that the attribute is used both as an input and as a predictable attribute. (It is considered to be an Input attribute when it is used to predict the value of another attribute.)
We found that highly accurate predictive models were obtained by using the Microsoft Decision Trees algorithm when we altered the default values for COMPLEXITY_PENALTY and MINIMUM_SUPPORT. We used COMPLEXITY_PENALTY = 0.10. This may result in a larger (more detailed) decision tree. A larger, more detailed decision tree may more accurately model the training data, even to the extent of modeling “noise” in the training data. This situation is called overfitting and typically results in poorer predictions on a holdout or testing set of data. In addition, a larger, more detailed decision tree may require slightly more computation to make predictions. The value of MINIMUM_SUPPORT was also lowered to 5–which may also result in a larger decision tree. Figure 9 shows these changes.
Predictive Modeling Results
Empirical results
As previously mentioned, the predictive accuracy of mining models were evaluated by examining them over the testing set (data corresponding to the weeks between December 1, 2004 and December 31, 2004).
We evaluated the mining model by using the lift chart functionality in SQL Server 2005 Analysis Services. A lift chart provides an overall picture of the predictive performance of a given data mining model, over a specified dataset. In this case, we used the test dataset for this evaluation. The lift chart compares the predictive performance of the mining model with an ideal model and a random model. Figure 9 shows the lift chart for Boolean twoweek outofstock predictions for the book Captain Underpants. The task is to predict a true/false value as to whether the book will be in stock or out of stock two weeks into the future at any store in the retail chain. Note that the overall predictive accuracy of this model is close to the ideal model.
Figure 10 summarizes the predictive performance for store/week combinations when the book was actually out of stock two weeks into the future.
Figure 11 shows the graphical depiction of the Microsoft Decision Trees algorithm predicting twoweek outofstock Boolean values.
Figure 12 summarizes the predictive accuracies for all five products (books) that were considered in this task. On average, the data mining models obtained by SQL Server 2005 Analysis Services can predict whether or not a book will be out of stock one week into the future with 98.52% accuracy. Predictions on whether or not the book will be out of stock two weeks into the future are, on average, 86.45% accurate. Predictive accuracies go up when predicting the actual salesbin value.
Sales opportunity
Conservative estimates based on the predictive models indicate that lost sales opportunities of between $3,405.48 and $6,810.95 resulted for one and two book copies respectively. These sales opportunities would not have been lost had the models been deployed at the beginning of the year. See Figure 13.
The method for calculating the lost sales opportunity for each item was computed by multiplying the number of outofstock (OOS) total store weeks by the twoweek Boolean predicted value. This generated the new OOS store weeks occurrences had the models been deployed. Multiplying the OOS predicted values by the percentage of actual book sales for the year by the respective retail sale price generates the total sales opportunity.
Sales Opportunity Formula
(# of total OOS weeks for all stores) x (2week Boolean predicted accuracy)X (% of actual sales across all stores) x (retail price)= Yearly increase in sales opportunity using Apollo OOS predictions
If you extrapolate these figures to the total retail line, the sales opportunity becomes extremely attractive. There exists the potential to increase sales by millions of dollars if inventory predictive models are implemented.
Automation of Inventory Predictions
Data mining functionality has been included in SQL Server 2005 Integration Services. Hence, Integration Services can be used to automate the process of making weekly or monthly outofstock predictions. These predictions provide the retailer with updated reports on store/product combinations that may be likely to experience an outofstock situation.
Apollo Data Technologies recommends automating not only the process of obtaining outofstock predictions on a regular basis, but also that of automating the process of evaluating the performance of the predictive models. The latter task can then be used to determine if the predictive accuracy of the trained data mining models has fallen below an acceptable level. If the trained outofstock predictive models fall below a prescribed predictive accuracy, it is likely that the trends and patterns extracted by the SQL Server 2005 data mining models have changed. In this case, new models will need to be constructed and finetuned.
Automation of outofstock predictions
The outofstock predictive models allow the retailer to have more intelligence about possible outofstock situations one and two weeks into the future. The common framework for deploying the outofstock predictive technology requires updating the outofstock predictions on a weekly or monthly recurring basis.
The process of producing the product/store combinations that may likely experience an outofstock situation can be implemented by using SQL Server 2005 Integration Services. This can be done by implementing and scheduling the following workflow.
Producing likely outofstock product/store combinations
For each product p of interest:

Determine the category c(p) to which product p belongs by querying the Product dimension information in the data warehouse.

Determine the name of the storeclustering model that groups together retail stores by the sales of category c(p). This can be done by using a query to a lookup table.

If product p has not been considered before, create a new predictive modeling dataset for product p as described in Outofstock predictive modeling dataset construction. If product p has been considered before, append rows to the predictive modeling dataset for product p. The new rows correspond to new store/week combinations.

Determine the name of the SQL Server 2005 Analysis Services outofstock predictive models for product p. This can be done via a query to a lookup table.

Using a SQL Server Integration Services Prediction Join task, execute the prediction join to obtain and write the outofstock predictions to either a relational table or an OLAP cube. For details on the DMX prediction join used to obtain outofstock predictions, see Appendix B.
Predictive accuracy automation
It is important to measure the performance of the outofstock predictive models over time to ensure that the information that is used to support retail stocking decisions is as uptodate and accurate as possible. To do this, we would implement a system that correlates actual stocking levels with those that have been predicted by the models. If the predicted values tend to be in agreement with actual levels, a level of confidence in the predictive models is obtained.
Again, the process of measuring the agreement between predicted sales levels and actual values can be done using SQL Server 2005 Integration Services. This is accomplished by comparing the predicted salesbin values for product/store/week combinations (which are written to a table when obtaining the outofstock predictions as discussed in Automation of outofstock predictions) with the actual sales levels when they are reported and updated in the data warehouse.
By aggregating the number of times that the predictions agree with actual sales values (salesbins), the retailer gets a feel for the accuracy of the predictive models. If the accuracy falls below a given threshold, the outofstock predictive models and possibly the storeclustering models need to be rebuilt and retuned. The cause of the incorrect predictions may be related to changing recent sales trends that have not been captured by models that were built in the recent past.
Conclusion
This paper describes an approach for building retail outofstock predictive models using SQL Server 2005 Analysis Services. When applied to the Project REAL data, the models produce very accurate predictions for the subset of items under consideration. Had the models been deployed for one full year, conservative estimates suggest that this retailer would have made thousands of dollars by circumventing lost sales opportunities. If you extrapolate these figures to the total retail product line, the sales opportunity becomes extremely attractive with the potential to increase sales by millions of dollars. This paper also provides recommendations for automating and implementing the model prediction and accuracy measurements into a lineofbusiness application using SQL Server 2005 Integration Services.
For more information:
Did this paper help you? Please give us your feedback. On a scale of 1 (poor) to 5 (excellent), how would you rate this paper?
Appendix A: Store Clustering Attributes
This table lists the storelevel attributes that were derived from the Project REAL data warehouse.
Table 1: Store Clustering Attributes
Attribute Name 

Beg Reader PB Cat Frac Holiday Sales 
Beg Reader PB Cat Frac Sales 
BG Bestseller510 Cat Frac Holiday Sales 
BG Bestseller510 Cat Frac Sales 
BGCKBKS Under15 Cat Frac Holiday Sales 
BGCKBKS Under15 Cat Frac Sales 
BG Reference Cat Frac Holiday Sales 
BG Reference Cat Frac Sales 
Blank Books Cat Frac Holiday Sales 
Blank Books Cat Frac Sales 
Board Block Touch Cat Frac Holiday Sales 
Board Block Touch Cat Frac Sales 
Cat Avg Weekly Modeled 
Cat Avg Weekly On Hand 
Cat Avg Weekly On Order 
Cat Fraction Holiday Sales 
Cat Fraction Sales 
Cat Holiday Disc Amt 
Cat Holiday Markdown Amt 
Cat Holiday Member Disc Amt 
Cat Holiday Sales Amt 
Cat Holiday Sales Qty 
Cat Total Disc Amt 
Cat Total Markdown Amt 
Cat Total Member Disc Amt 
Cat Total Sales Amt 
Cat Total Sales Qty 
Chapter Books Cat Frac Holiday Sales 
Chapter Books Cat Frac Sales 
Christian Insp Cat Frac Holiday Sales 
Christian Insp Cat Frac Sales 
City 
Cooking Cat Frac Holiday Sales 
Cooking Cat Frac Sales 
Current Affairs Cat Frac Holiday Sales 
Current Affairs Cat Frac Sales 
Family Child Care Cat Frac Holiday Sales 
Family Child Care Cat Frac Sales 
Fantasy Cat Frac Holiday Sales 
Fantasy Cat Frac Sales 
Fiction Cat Frac Holiday Sales 
Fiction Cat Frac Sales 
Fiction Literary Cat Frac Holiday Sales 
Fiction Literary Cat Frac Sales 
Fiction PB Young Readers Cat Frac Holiday Sales 
Fiction PB Young Readers Cat Frac Sales 
Hist Biog Cat Frac Holiday Sales 
Hist Biog Cat Frac Sales 
Humor Cat Frac Holiday Sales 
Humor Cat Frac Sales 
Juv Activity Cat Frac Holiday Sales 
Juv Activity Cat Frac Sales 
Juv Christmas Cat Frac Holiday Sales 
Juv Christmas Cat Frac Sales 
Juv Series HC Cat Frac Holiday Sales 
Juv Series HC Cat Frac Sales 
Juv Series PB Cat Frac Holiday Sales 
Juv Series PB Cat Frac Sales 
Juv Work Books Cat Frac Holiday Sales 
Juv Work Books Cat Frac Sales 
Linear Ft 
Literature Cat Frac Holiday Sales 
Literature Cat Frac Sales 
Magazines Cat Frac Holiday Sales 
Magazines Cat Frac Sales 
Management Cat Frac Holiday Sales 
Management Cat Frac Sales 
Manga Japanese Comics Cat Frac Holiday Sales 
Manga Japanese Comics Cat Frac Sales 
Mystery Cat Frac Sales 
New Age Cat Frac Holiday Sales 
New Age Cat Frac Sales 
Newspapers Cat Frac Holiday Sales 
Newspapers Cat Frac Sales 
Pict Sty Bks HC Cat Frac Holiday Sales 
Pict Sty Bks HC Cat Frac Sales 
Pict Sty Bks PB Cat Frac Holiday Sales 
Pict Sty Bks PB Cat Frac Sales 
Pop Rock Cat Frac Holiday Sales 
Pop Rock Cat Frac Sales 
Romance Cat Frac Holiday Sales 
Romance Cat Frac Sales 
Science Fiction Cat Frac Holiday Sales 
Science Fiction Cat Frac Sales 
Self Improvement Cat Frac Holiday Sales 
Self Improvement Cat Frac Sales 
Single Cards Cat Frac Holiday Sales 
Single Cards Cat Frac Sales 
Spinner Cat Frac Holiday Sales 
Spinner Cat Frac Sales 
Square Feet 
State 
Techno Thriller Espionage Cat Frac Holiday Sales 
Techno Thriller Espionage Cat Frac Sales 
Teen Fiction Cat Frac Holiday Sales 
Teen Fiction Cat Frac Sales 
Total Holiday Sales 
Total Sales 
Total Weekly Avg Modeled 
Total Weekly Avg On Hand 
Total Weekly Avg On Order 
Appendix B: OutofStock DMX Query
Following is the DMX query that was used to obtain outofstock predictions.
SELECT t.[Unique_Store_Week_ID], [CBCaptainUnderpantsDT].[One Week OOS Boolean], PredictProbability([CBCaptainUnderpantsDT].[One Week OOS Boolean]), [CBCaptainUnderpantsDT].[One Week Sales Bin], PredictProbability([CBCaptainUnderpantsDT].[One Week Sales Bin]), [CBCaptainUnderpantsDT].[Two Week OOS Boolean], PredictProbability([CBCaptainUnderpantsDT].[Two Week OOS Boolean]), [CBCaptainUnderpantsDT].[Two Week Sales Bin], PredictProbability([CBCaptainUnderpantsDT].[Two Week Sales Bin]) From [CBCaptainUnderpantsDT] PREDICTION JOIN OPENQUERY([ApolloDWSDM], 'SELECT [Unique_Store_Week_ID], [ChapterBooksCluster], [CurrentWeekSales], [CurrentWeekOnHand], [CurrentWeekOnOrder], [OneWeekBackSales], [TwoWeeksBackSales], [ThreeWeeksBackSales], [FourWeeksBackSales], [FiveWeeksBackSales], [OneWeekBackOnHand], [TwoWeeksBackOnHand], [ThreeWeeksBackOnHand], [FourWeeksBackOnHand], [FiveWeeksBackOnHand], [OneWeekBackOnOrder], [TwoWeeksBackOnOrder], [ThreeWeeksBackOnOrder], [FourWeeksBackOnOrder], [FiveWeeksBackOnOrder], [OneWeekSalesBin], [OneWeekOOSBoolean], [TwoWeekSalesBin], [TwoWeekOOSBoolean], [FirstWeekSalesChange], [SecondWeekSalesChange], [ThirdWeekSalesChange], [FourthWeekSalesChange], [FifthWeekSalesChange] FROM [dbo].[CB_CaptainUnderpants_Testing2] ') AS t ON [CBCaptainUnderpantsDT].[Chapter Books Cluster] = t.[ChapterBooksCluster] AND [CBCaptainUnderpantsDT].[Current Week Sales] = t.[CurrentWeekSales] AND [CBCaptainUnderpantsDT].[Current Week On Hand] = t.[CurrentWeekOnHand] AND [CBCaptainUnderpantsDT].[Current Week On Order] = t.[CurrentWeekOnOrder] AND [CBCaptainUnderpantsDT].[One Week Back Sales] = t.[OneWeekBackSales] AND [CBCaptainUnderpantsDT].[Two Weeks Back Sales] = t.[TwoWeeksBackSales] AND [CBCaptainUnderpantsDT].[Three Weeks Back Sales] = t.[ThreeWeeksBackSales] AND [CBCaptainUnderpantsDT].[Four Weeks Back Sales] = t.[FourWeeksBackSales] AND [CBCaptainUnderpantsDT].[Five Weeks Back Sales] = t.[FiveWeeksBackSales] AND [CBCaptainUnderpantsDT].[One Week Back On Hand] = t.[OneWeekBackOnHand] AND [CBCaptainUnderpantsDT].[Two Weeks Back On Hand] = t.[TwoWeeksBackOnHand] AND [CBCaptainUnderpantsDT].[Three Weeks Back On Hand] = t.[ThreeWeeksBackOnHand] AND [CBCaptainUnderpantsDT].[Four Weeks Back On Hand] = t.[FourWeeksBackOnHand] AND [CBCaptainUnderpantsDT].[Five Weeks Back On Hand] = t.[FiveWeeksBackOnHand] AND [CBCaptainUnderpantsDT].[One Week Back On Order] = t.[OneWeekBackOnOrder] AND [CBCaptainUnderpantsDT].[Two Weeks Back On Order] = t.[TwoWeeksBackOnOrder] AND [CBCaptainUnderpantsDT].[Three Weeks Back On Order] = t.[ThreeWeeksBackOnOrder] AND [CBCaptainUnderpantsDT].[Four Weeks Back On Order] = t.[FourWeeksBackOnOrder] AND [CBCaptainUnderpantsDT].[Five Weeks Back On Order] = t.[FiveWeeksBackOnOrder] AND [CBCaptainUnderpantsDT].[One Week Sales Bin] = t.[OneWeekSalesBin] AND [CBCaptainUnderpantsDT].[One Week OOS Boolean] = t.[OneWeekOOSBoolean] AND [CBCaptainUnderpantsDT].[Two Week Sales Bin] = t.[TwoWeekSalesBin] AND [CBCaptainUnderpantsDT].[Two Week OOS Boolean] = t.[TwoWeekOOSBoolean] AND [CBCaptainUnderpantsDT].[First Week Sales Change] = t.[FirstWeekSalesChange] AND [CBCaptainUnderpantsDT].[Second Week Sales Change] = t.[SecondWeekSalesChange] AND [CBCaptainUnderpantsDT].[Third Week Sales Change] = t.[ThirdWeekSalesChange] AND [CBCaptainUnderpantsDT].[Fourth Week Sales Change] = t.[FourthWeekSalesChange] AND [CBCaptainUnderpantsDT].[Fifth Week Sales Change] = t.[FifthWeekSalesChange]
Download
Inventory Predictive Modeling via Microsoft SQL Server 2005 Analysis Services
672 kb
Microsoft Word file