by Richard Lees. You can view Richard’s
Web site at:http://RichardLees.com.au.
Background of Data Mining
Data mining was once a very expensive technology that was
used by small elite teams in head offices of a few very large companies. Now
that data mining algorithms have become a commodity, anyone can employ data
mining algorithms at a very low cost. For example, data mining tools have been
included in SQL Server since 1998, and new tools, such as the Data Mining Add-Ins
for Excel, can be downloaded for free.
The purpose of this paper is to walk you through a complete,
real-life scenario for using data mining to predict customer profitability.
Imagine you are a commercial organization, continually receiving requests for
goods or services from new customers. You want to predict how much business a
new customer will bring you in the longer term. This might help you determine
which customers deserve special consideration. To do this you will use your
history of new customer orders and how the customers developed in the longer
term. In this walkthrough, you will use a classic data mining method to analyze
real-life data.
Sample Data for Data Mining
In data mining, you really need to have real data to
discover interesting relationships. So for this paper we will use some real
data that I have, although it is not customer data. The data is the HTTP web
logs from my Internet site. The structure of the Internet log data is similar
to what a commercial organization might have for new customer requests.
Customer requests would typically have attributes such as the customer name and
demographics, product and channel, whereas the http requests have such
attributes as the client, the location from which the request was issued, agent,
resource requested, etc. While a commercial organization might want to predict
profitability, we will predict http response times. Note that the purpose of
predicting web response times is only to serve as a good learning exercise and
demonstration of data mining.
You can see a live demonstration of the model we create in
this paper at the site, http://RichardLees.com.au.
Just click on the BI Demonstations/Data
Mining/Web Response Time links. The demonstration predicts web response
times for the last 50 hits to the site using a data mining model based on
decision trees with regressors. The model is reprocessed each weekend to keep
it current. The online report compares actual versus predicted response times
and plots them on a scatter graph.
Predicting Customer Profitability
Essentially our challenge is that we have a collection of
attributes, some discrete, some continuous, and we want to predict a continuous
variable (a number). A typical customer profitability model would use customer
demographics, channel, product requested, volume etc to predict future
profitability. Our model will use geographic location, client agent, resource
etc to predict response time.
There are thousands of ways that you could use a very similar
structure to predict a continuous number for a different real-world
application. These applications include predicting delivery time, project time,
project revenue, employee tenure, lease duration, residual value etc. Therefore,
once you have gone through this exercise, it should be very easy for you to
build your own customer profitability model or other similar model using your
organization’s real data.
We will build a data mining model using the following
attributes to predict response time
- City
- Country
- Client operating system
- Client agent (browser and version)
- HTTP status
- HTTP operation
- Referring server
- Target resource
- Target resource type (e.g., .htm, .jpg, .zip)
- Bytes out
In a customer profitability prediction model, the attributes
would be different, but in both cases they are a mix of discrete variables
(such as country) and continuous variables (such as bytes out and ordered
volume).
In any data mining exercise, one of the first tasks is to
identify the input variables and the output (predicted) variable(s). Note, in
some data mining tasks there is no desire to predict a variable, for example
some clustering exercises do not make predictions, they just cluster.
Designing the Data Mining Model in SQL Server 2008
You can reproduce this data mining model by downloading the DataMiningWebSample SQL Server database
from the download page on http://RichardLees.com.au. This download
file contains a SQL Server 2005/2008 backup file of the raw web logs, and a VS
solution that contains the completed data mining model. You shouldn’t need the
completed VS solution but it is provided in case you have trouble and want to
view a completed model.
Building the Data Mining Model
1. Start the SQL
Server Business Intelligence Development Studio from Start/Programs/SQL Server 2008
2. From the File
menu, click New/Project and select Analysis Services Project.
3. Add a Data Source
that points to the SQL Server database that you have downloaded and restored.
4. Add a Data Source
View, and add to it all the tables and views from the SQL database.
5. Right click Mining
Structures and then click New Mining
Structure.
6. Accept the default data mining technique of Decision Trees. Later you can add other
models, based on different algorithms, and apply them to the same data.
7. Select InternetLog
as the case table. We won’t be using
a nested table for this model. Don’t select any other tables; click next. The
key for InternetLog is the RID column.
8. Select all other columns as input columns, except ResponseTime and LogTime. We don’t need to select all columns for input, but because
at this stage we are not sure which ones are useful, we can include them all as
candidate columns.
9. Check the Predict
column for ResponseTime. The diagram
below show the dialog boxes provided by BIDS for choosing columns and setting
the content types for each column.
10. Click Next
and accept the default training size of 30 percent. If your model is taking too
long to process, or if you are using a server without much memory or a slow
CPU, you will want to change the model settings to reduce processing time. One
way to reduce processing time is to increase the holdback percent, meaning you
reduce the amount of data that is used for training. In a large data set, you
can increase the size of holdback up to as high as 90 percent. Another way to
speed up processing is to reduce the number of input variables. For example,
bytes-in and city are probably of little use so you can remove them from your
model.
11. Accept the default data mining structure name, but add DT to the end of the model name. This
is to help you remember that that this model is using a decision trees
algorithm.
12. Click Finish
and process the data mining model.
.jpg)
Figure 1
Viewing the Data Mining Model
The data mining model has been processed and we can now view
what the model has found.
Because our model is a decision tree structure, we can view
the model in its tree view. There is one tree for every discrete outcome.
Because this model has one continuous outcome (ResponseTime), there is only one tree. Note that your decision tree
may look a little different from this picture, since SQL Server has randomly
selected which records to hold back from training, which will be used to
validate the model.
.jpg)
Figure 2
Step 1. Click on the Mining
Model Viewer tab and select the nested Decision
Tree tab. You will see the first five levels of the decision tree
structure. Each node in the tree is represented in the viewer as a box with a
decision heading such as ResourceType=gif.
Also, if you focus on any one of the nodes, you can see the outcome at that
node. If you focus on one of the nodes, you can read information about this
node. For example, in the node labeled Resource=’/IndicatorImage.gif’,
I can read the following:
- Criteria for this node : Bytes Out < 1821932
and OK not = 'NOK' and Resourcetype not = '.gif' and Resource =
'/IndicatorImage.aspx'
- Support for this node: Existing Cases: 4122.
This tells you how many records in the training data satisfy the criteria
above.
- The prediction of our output variable:
ResponseTime.Notice how the prediction is not just a number.This is because we
had continuous input columns (BytesIn and BytesOut).Essentially the decision
tree is using these continuous input variables as regressors to assist with the
prediction.It is predicting the ResponseTime at this particular node to be
474.474+0.085*(Bytes In-1,017.390)-0.190*(Bytes Out-556.796).It is logical to
me that it would use BytesIn and BytesOut in this way.The larger the number of
bytes transferred, the longer response time, given other attributes are the
same. Note that the units of ResponseTime are milliseconds, the same as the
source data. ResponseTime = 474.474+0.085*(Bytes In-1,017.390)-0.190*(Bytes
Out-556.796).
Step 2. Now click on the Dependency Network nested tab. This tab provides an extremely
simplified view of the relative importance of the input variables in predicting
our output variable. That is, the graph indicates how useful each of the
variables is in predicting response time. I say it is extremely simplified
because the attributes do not work alone, but they work in combinations.
However, the diagram can be very useful: for example, I find this type of graph
particularly useful in a basket analysis model, where it will highlight the
dependencies between products.
.jpg)
Figure 3
Step 3. Click on Response
Time. All of the input variables will change color to indicate that they
help predicting response time.
Step 4. Now drag the slider bar on the left downward. Notice
how some of the input variables are shaded out. These are the less important
variables.
Step 5. Keep sliding the bar downward until you isolate the
most important input variable. In my model it happens to be ResourceType. That is logical. For
example, the response time will vary largely depending on whether the HTTP
resource is a .jpb, aspx, zip.
SQL Server makes it very easy to try other algorithms for
this model. I suggest that you try creating some additional models after you
have completed this paper. To add a new model to existing data, simply click on
the Create a related mining model
button in the Mining Models tab and
choose an appropriate algorithm. You won’t be able to use all the algorithms
for this particular model. For example, the time series algorithm is not an
appropriate choice, and some algorithms won’t like the continuous input and/or
output variables. If you get an error, you can choose to ignore continuous
input variables, or you might choose to discretize the continuous variables.
SQL Server can help you do this.
Determining Model Accuracy
Microsoft SQL Server 2008 comes with tools to help you
determine the accuracy of your model’s predictions. The diagram shows an
accuracy chart that I created for the model in the preceding analysis.
.jpg)
Figure 4
In this response time model, we are predicting a continuous
number. So how do we know if the model is good or bad, since it can’t really be
‘right’ or ‘wrong’ with a continuous number?
A good way to validate and compare models is to determine the
correlation coefficient between the actual response time and the predicted
response time. The higher the correlation coefficient, the stronger the
correlation, and therefore the better the model.
When testing the validity of your model it is important to
test the model with data that it has not used for training. This data is also
called ‘naive data’. When you create a data mining model, by default SQL Server
will hold back 30 percent of the data for testing, though you can change this
amount by setting the HoldOutMaxPercent
property. To generate an accuracy chart, SQL Server will use this ‘held back’
data by default, but you can also opt to use your own dataset.
For this walkthrough, the sample database contains a table
called InternetLogNew, which
contains new HTTP records that the model also has not used for training. InternetLogNew is a view over log
records that the model has not been trained on, and the view will randomly
select 50 records from this table. Therefore, each time you refresh the scatter
chart, a new 50 records will be selected, so the results might be slightly
different depending on which records are chosen.
1. Click on Mining
Accuracy Chart tab. In the Input
Selection nested tab you are able to select the test data you wish to use.
By default, it will use the data held back, but for this walkthrough, select
another table, InternetLogNew.
2. Click on the Lift
Chart and select Scatter Plot.
SQL Server will plot predicted values against actual values for your test data
against the model. If the data mining model was a perfect predictor, all the
dots will be along the diagonal. If the data mining model is no good as a
predictor, the dots will tend to be scattered randomly across the chart.
Generally you will find that the dots will be scattered, with a tendency to
group along the diagonal. However, do not expect to always see the data tightly
clustered along the diagonal. Data mining models can be helpful, even though
their correlation does not look very strong. Any correlation above 0 means that
the model can help you make predictions more accurately than random.
Querying the Data Mining Model
This is an area where I believe Microsoft SQL Server has
taken data mining to a new level. SQL Server makes it very easy to dynamically
query the data using numbers entered at the client, or data coming from a
relational database. The following procedure describes how to create queries to
get predictions or get more details about the patterns in the model.
1. Click on Mining
Model Prediction tab.
2. Click on Select
Case Table button and select InternetLogNew.
3. In the lower grid, click on the first row below Source.
In the drop down combo list select InternetLogNew
table. In the Field column leave it at RID.
4. Repeat the above step for every column in the InternetLogNew table. Although it can
be tedious, I suggest that you use the wizard this first time to build the
basic query. After you learn how to edit the data mining query, you can bypass
the UI and type the column names directly if you want.
5. Add one more row. This time, for source, select Prediction Function and in the Field column select Predict. There are actually two Predict
functions, one for scalar and one for tabular arguments, but it won’t matter
which one you choose.
6. Using your mouse, drag Response Time from the data mining model (not the SQL table) on the
top left of your screen to the Criteria/Argument
column. The wizard will fill out the field.
.jpg)
Figure 5
7. Click on the Switch
to query result view button at the top left of the screen. You will now see
the data from the new Internet records and the associated data mining
prediction. That’s how easy it is to query your data mining model with fresh
data. The actual query is not complex.
Click on the top left button and select query view. You will
see the DMX query that the wizard has generated for you. The DMX language has
been designed to be as close to the SQL language as possible. Of course the DMX
language uses many functions that aren’t available in SQL, but I think you will
find DMX a relatively easy language to learn. In my opinion it is much simpler
to learn than MDX, which is the OLAP cube query language.
Note that the OpenQuery
requests data from the relational database using the view, InternetLogNew. This view is a query that gets the latest Internet
log records (or in our demonstration randomly selects log records that are not
in the training data). The T-SQL statement for the view is provided here. You
can see that it will get the top 50 records and that it is ordered by a random
key, so that a different 50 records are retrieved on each request. In your
production implementation, you would not order by a random key, but rather
would select the last 50, or the top n records ordered by their predicted
profitability, etc.
CREATE view [dbo].[InternetLogNew] as
SELECT top 50
l.rid
,l.LogTime
,l.ResponseTime
,l.BytesIn
,l.BytesOut
,g.Country
,g.State
,g.City
,t.resource
,t.resourcetype
,ca.clientagnt
,ca.os
,case when l.ResponseTime>60000 and BytesOut=0 then 'NOK' else hc.OK end OK
,hc.http_status
,o.operation
,r.referringServer
,ch.ClientHost
FROM InternetLogTableNew l
inner join ClientHosts ch on ch.clienthostid=l.clienthostid
inner join IPCountry ipc on ipc.IPCountryID=ch.IPCountryID
inner join Geographies g on g.GeographyID=ipc.GeographyID
inner join Targets t on t.Targetid=l.targetid
inner join ClientAgents ca on ca.clientagentid=l.clientagentid
inner join HTTP_codes hc on hc.httpstatuscode=l.Status
inner join Operations o on o.operationid=l.operationid
inner join Referrers r on r.ReferrerID=l.ReferrerID
ORDER BY RAND(convert(float,bytesin)*DATEPART(ms,GETDATE())/1000.0)
+RAND(convert(float,bytesout)*DATEPART(ms,GETDATE())/1000.0)
This particular query makes predictions for testing;
however, another way customers are using data mining models is to make
predictions in bulk. The predictions are then stored back in the data warehouse
or in an OLAP cube for aggregated analysis. For example, you might predict
summarized revenue by region, salesperson, or product group etc.
Creating Reports
Sometimes data mining models are created only to perform an
analysis and are then discarded. However, increasingly organizations are
building data mining models, like the one above, that become part of the
everyday applications available to a wide number of staff and/or customers.
Because SQL Server has generated the query for you, it is a very easy task to
put the query into a Reporting Services report, and you have a data mining
prediction report. Use of data mining reports is transforming the way that data
mining is employed. It is taking data mining from the exclusive preserve of
elite head office teams to the whole organization.
For example, I created the following report based on the
query above, and then created a scatter chart by using the Reporting Services
charting component.
.jpg)
Figure 6
To create a Reporting Services report using the data mining
model to make predictions from your model, follow these instructions. This will
add a Reporting Services project to your data mining solution. I like the way
BIDS (Business Intelligence Development Studio) can keep these projects in the
same solution. It is not uncommon to have a BIDS solution with three projects:
• SSIS (Integration Services for data
extraction and load)
• SSAS (Analysis Services for cube and/or data
mining models)
• SSRS (Reporting Services).
1. In your Visual Studio project, right click on the
solution and click Add New Project.
Select Report Server Project,
provide a name and click OK.
2. Right click on Shared
Data Sources, and set up a new data source that points to the Analysis
Services where your data mining model is deployed.
3. Right click on the Reports
folder and select Add New Report.
4. Click Next.
5. Click Query
Builder.
6. Click the right most button Design Mode.
7. Now paste the data mining query, the wizard created for
you at step 8 above, into the query pane.
8. Click OK.
9. Click Next in
the Design the Query pane.
10. Select Tabular
as the Report Type.
11. Add all the Available
Fields to the Details section of
the report by highlighting the fields and clicking on the Details button.
12. Select your favorite Table Style, or leave to the
default and click Next. You can
change the format of the report later. Let’s just get the query and result set.
13. Enter a name for your report, such as Response Time Prediction.
14. Click Finish.
15. You are now able to preview your report in the Preview pane. You can change the
formatting, edit the DMX query, add a scatter Plot chart etc to your report and
deploy to your reporting services server.
16. Notice how fast the data mining report is to make 50
predictions. Generally, you will find that data mining predictions are very
fast, it is just the model processing that can take some time.
Deploying and Updating the Data Mining Solution
When you create your production data mining model, you may
want to schedule a regular update of the model. This is necessary because the
factors that predict a profitable customer will change over time, and you want
the data mining model to adapt to those changes. So, for example, you might
have the mining model reprocess at the end of each week or month. During
reprocessing, the original data mining model remains available for online
querying.
Conclusion
In this short time we have created a data mining model,
validated the model, and used the model to create a user-friendly report. When
you embed data mining in a report, it also becomes accessible to non-technical
or customer-facing employees who know little about data mining but can benefit
from the predictions.
This white paper was intended to help you get started with
data mining. Data mining is a wide and exciting area of business intelligence,
and many organizations have yet to appreciate how they can profit from its use.
I anticipate the application of data mining to grow exponentially over the next
few years. By and large, the application of data mining is constrained only by
our imagination.
For more information on Data Mining
• http://RichardLees.com.au (real time data
mining demonstrations)
• http://RichardLees.blogspot.com (one
of many data mining bloggers)
• http://msdn.microsoft.com/en-us/sqlserver/cc511476.aspx
(Microsoft Technical Resources for Data Mining)
• http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470277742.html
(good text about SQL Server Data Mining)
• http://www.microsoft.com/sqlserver/2008/en/us/data-mining.aspx
(Microsoft data mining marketing page)
• http://social.msdn.microsoft.com/Forums/en-US/sqldatamining/threads
(SQL Server data mining forum)
• http://www.sqlserverdatamining.com
(SQL Server Data Mining dev team’s site with tips/articles/whitepapers etc)
About the author. Richard Lees is a business intelligence developer and consultant. See his site for
live demos based on SQL Server Data Mining.