Why Use Microsoft Data Mining?
SQL Server Community Article
Writer: Mark Tabladillo, Ph.D
Technical Reviewer: Jeannine Takaki, SQL Server UE
Published: August 2009
Applies to: SQL Server 2005, SQL Server 2008
Summary:This community article provides an introduction to data mining, and defines related terminology. The article provides examples from the daily news of questions that potentially can be answered by data mining, and broadly outlines the approach to creating and deploying data mining solutions in a business. Finally, the author explains the advantages of using Microsoft data mining technologies, and provides a list of resources for more information.
This community article discusses why people should use Microsoft data mining. The Microsoft data mining tools leverage the strengths of Microsoft SQL Server data management software and the Microsoft Office system. Data mining options in a Microsoft solution range from ad hoc analysis within Microsoft Office Excel, to collaborative solutions using Office SharePoint Server, customized Microsoft .NET connection software applications, and production-level stored procedures in SQL Server. However, many enterprise environments include software applications and data warehouse solutions from multiple vendors, and the Microsoft technology respects these designs by allowing for open data connectivity and access with other software products, including other data mining solutions.
Why Should I Be Interested in Data Mining?
Data mining is an analytical activity that requires people to be successful. Microsoft has defined three specific job roles for people who would be interested in data mining (Microsoft, 2008b). These roles are organized by function, and a specific person or team may span duties or interests in more than one group:
• Analysts – Subject matter experts who understand the data and are able to translate the results of data mining into actionable business information.
• Developers – Intend to integrate data mining within custom application solutions.
• Architects – Plan to meet the needs of analysts and developers, and may have a strong interest in data integration and reporting technologies.
You may take on one or more of these roles, even if you have another title like Chief Information Officer or Software Architect. Anyone who does more than basic data analysis can benefit from data mining.
To get you started with on data mining, this community article provides the following basic information:
• What is data mining?
• Is the correct term “Data Mining” or “Machine Learning”?
• What do we get from data mining?
• What are some problems that data mining solves?
• What are the costs for data mining?
• Are there advantages to using Microsoft data mining technologies?
What Is Data Mining
A new book on this topic was just released, Data Mining with Microsoft SQL Server 2008 (MacLennan, Tang, & Crivat, 2009). These authors have worked on development of the Microsoft data mining technology. Their definition of data mining is “the process of analyzing data to find hidden patterns using automatic methodologies” (MacLennan et al., 2009, p. 1).
First, data mining is an activity. When you use Microsoft technology to do data mining, you use SQL Server Analysis Services, which requires a SQL Server license. By moving the activity to a server, Microsoft makes data-intensive processes run in an environment designed for processing efficiency and connectivity to enterprise systems. Of course, SQL Server can run on a desktop, and I run it on the Windows Vista operating system for product demonstrations. However, data mining works best in a server environment. In other implementations you may know of, people have created data mining applications for either desktop or server application.
Second, data mining means analysis. The job role of “analyst” describes the person who typically is most interested in doing data mining. Microsoft has made a free plug-in available for Office Excel 2007 which allows an analyst to use Microsoft Office to connect to a SQL Server implementation of data mining. The Office Excel plug-in lets the analyst perform data mining on Office Excel data. The patterns in the data are analyzed on the server and the results are sent to Office Excel. Even without this plug-in, a trained analyst can use the free Business Intelligence Development Studio (BIDS) which ships with SQL Server to create data analysis solutions. The BIDS product is provided in both SQL Server 2005 or SQL Server 2008, and using the Microsoft Visual Studio development system framework, an analyst can create a project or solution (collection of projects), either of which can include Microsoft data mining. The main point is that data mining is another important tool which analysts can and should use.
Third, data mining reveals hidden patterns. You might think the adjective “hidden” to be counterintuitive, since you already have and see all the data. In fact, finding patterns is a normal human activity, and analysts can visually perform pattern matching and recognition on smaller datasets. However, an analyst might not discover all the patterns even in a small dataset. Data mining assumes that the data is empirical, and its results come solely from the information presented, not from any outside information or known patterns. Data mining in general (and Microsoft data mining in particular) creates conclusions based on accepted mathematical techniques for pattern matching, and outputs these as models, or patterns that could be considered as an explanation of an empirical dataset. Typically, the advanced mathematical techniques used in data mining come from academia or professional mathematicians, and data mining is an active area of university research. For our everyday use, data mining software makes it easy to apply sophisticated mathematics to discover hidden patterns.
Finally, data mining is an activity that can be automated or programmed. Software developers are an important audience for Microsoft data mining technology. Microsoft has not only provided user tools like BIDS and the Office Excel 2007 plug-in, but also provided the DMX (Data Mining Extensions) language and the .NET Framework for custom applications or solutions. Programmability extends the activity beyond just ad-hoc analysis, and allows analysts to leverage their findings into full enterprise-level solutions. Microsoft data mining is designed to be automated as needed, which greatly extends the potential for custom development and the possibilities for enterprise automation.
“Data Mining” or “Machine Learning”?
Many terms have been used in consulting and academia to describe data mining in general (and what Microsoft data mining in particular does). I too use alternative terms with regular frequency. I use the term predictive analytics to describe any statistical technique which helps provide a predictive look into the future. The Microsoft data mining tools can provide predictive analysis, but also could be used for descriptive (or historical) analysis too. Because the tools have dual use, the term predictive analysis does not sufficiently describe Microsoft’s use or implementation.
Another term I use is machine learning. Increasingly, technical universities offer courses in data mining or machine learning. Many times, the curriculum and topics are similar or identical. However, in use, most of the books I have on this topic prefer the term data mining (Berry & Linoff, 1997, 2000; Bigus, 1996; Cerrito, 2006; Han & Kamber, 2006; Langit, Goff, Mauri, Malik, & Welch, 2009; MacLennan et al., 2009; Tan, Steinbach, & Kuman, 2006; Tang & MacLennan, 2005; Witten & Frank, 2005). To illustrate the convergence of terminology, one book that I have is titled “The Elements of Statistical Learning” (Hastie, Tibshirani, & Friedman, 2001) but the subtitle for the book is “Data Mining, Inference, and Prediction”.
To me, the term mining has a connotation of being on a hunt and looking for something valuable. Even though diamonds and silver and coal are mined and are all valuable, the usual business metaphor is mining for gold. Also, the mining imagery involves a data miner, a person doing the activity.
By contrast, the term machine learning implies a machine doing automatic processing. The goal of machine learning is to achieve artificial intelligence, and it is important to advance academic theory and computer science to best leverage information technology. However, the term data mining suggests perhaps a more romantic notion of a data miner on a quest for discovery, needing the aid of the computer sidekick in pursuit of hidden patterns.
Another term used by the Association for Computing Machinery (Association for Computing Machinery, 2008) is knowledge discovery in databases (KDD). This association has been a respected force in the scientific computing area for decades, and attracts membership from leading academics and industry professionals. They use this term in the name of an annual conference, and the papers submitted to the conference cover data mining, predictive analytics, and machine learning.
In summary, both industry and academia have settled on the term data mining, but like me, you may find yourself using other terms depending on the situation.
What Do We Get from Data Mining?
I would love to say that we could solve all the world’s problems if we had enough data. If you have been talking with some data mining vendors or consultants, they might even imply that we could! Realistically, however, we all have to read beyond the marketing spin.
Back to the core definition: data mining reveals hidden patterns, and we get to see those patterns as models. It would be appropriate to say statistical models, because these models provide either descriptive or predictive statistical information. Statistics are important to analysts, but they are already important to anyone using business intelligence systems, even without data mining. Therefore, this article does not talk in detail about statistics. A good data mining text should include what statistical information you would need to know for data mining applications.
So what is a model? I have had fun asking this question to live audiences. The term has various connotations that include fashion models, model cars, or model homes. Generally a model is a description of something else, and in data mining, a model describes a hidden pattern in the data.
Microsoft data mining produces models. Some of the Microsoft models are expressed as equations, but more normally, the technology presents results showing relationships based on data. Data mining results are entirely dependent on the data that was used to create the model.
The following diagram comes from the SQL Server Books Online and illustrates the content of a model. What this diagram tells you is that a Microsoft mining model includes metadata (information about the data), patterns (which could be rules or formulas), and bindings (meaning data bindings to the mining structure, the place where the data is defined for data mining).
Figure 1. Mining Model Architecture (Microsoft, 2008a)
It is important to note that the result of data mining is not just an output equation. That difference was important to me coming from an applied statistical background, since in much of that work, the output is an equation with variables and weights. Data mining can and in some cases should present equations, but as a general rule, there might not be a specific equation or group of equations that completely describes an entire model.
In a many enterprise business environments, the goal of analysis is to solve some business objective. Even in places where statistics and statistical analysts have provided solutions, the output should provide actionable information. Decision makers want insight into how to make decisions. Whether these results come from statistics or data mining, what counts is translating mathematical results into business language. That translation requires that the person who does the analysis should know both the industry-specific objectives and the data mining technology. Data mining does not have life independent of a professional analyst who can interpret and apply the results toward a business question.
Thus, you cannot say that data mining cannot solve the world’s problems. Data mining doesn’t directly solve any problems—instead, it is the data miner, or the trained professional, who bridges the gap between a data mining model and a business problem.
Organizations that are serious about using data mining technology therefore should include ongoing training and development of their analysts, and make sure that their data miners learn about the technologies and tools for solutions, and not focus solely on the essential industry-specific objectives
If you use Microsoft data mining, as a professional data miner it is important that you invest in understanding what the possible models are and what they produce. Microsoft has provided free information online, and I have provided additional resources at the end of this paper. In addition to free information, larger companies should leverage data mining consultants and data mining training to accelerate either their entry into data mining or to or current use.
What Are Some Problems that Data Mining Solves?
Some typical problems are categorized in the book Data Mining with SQL Server 2008 (MacLennan et al., 2009, pp. 4-5):
• Recommendation generation – After a customer chooses one or more products, data mining suggests another product.
• Anomaly detection – Commonly, fraud detection in the financial industry means looking for that one transaction or one customer among thousands who might be committing fraud. Data mining can find a single observation among even the millions which might be different.
• Churn analysis – The term churn refers to losing a repeat customer or client, and knowing what early indicators might indicate someone is ready to switch can be important.
• Risk management – Credit ratings are often based on multivariate formulas which help predict levels of risk.
• Customer segmentation – Grouping customers or clients together, even by their own self-determined characteristics, can allow large organizations to manage marketing campaigns or even just organize their service professionals around similar groupings.
• Targeted ads – Marketers use data mining to deliver customized ads online, but organizations always want to know how to tailor any communications to be based on what they already know about their customers or clients.
• Forecasting – Time-series analysis takes data from the past, and provides a look into the future, even when there are seasonal increases or declines.
In past presentations, I have drawn many examples from the current news. I like the newspaper USA Today, and I have read from that paper to discover examples of questions that data mining can answer. With that experience in mind, I’m going to read today’s news from MSN.com (Microsoft, 2009b) and see where data mining may provide more insight:
U.S. news – One story reports that the U.S. economy shrank at the fastest clip since 1982. Data mining provides time-series tools which help provide comparable reporting even when seasonal trends might influence the data. In the United States, many retailers experience increased sales during the holiday season between November and January. Experienced retailers know to adjust their analysis to account for what is a different time compared to other times.
World news – One story tells us that record heat has been scorching southern Australia. More climate data can provide the associative variables to help understand why temperatures rise. The classic statistical wisdom is that correlation does not prove causality, meaning that just because two metrics might rise and fall together does not mean that one causes the other. Causality is a tricky subject (and beyond the scope of this community article). However, data mining might provide insight to what variables have strong associations with the temperature outcome.
Business news – One story says that consumers are more upbeat in January. Data mining can provide an important contribution to surveillance (meaning collecting and analyzing surveys). Perhaps you have been surveyed for some study, and we know of pollsters surveying voters just before and after elections. Most surveys are performed for a sample of the full population, and therefore analysis requires applying weights to the results to make the sample look more like the population. Data mining can provide insight into how to provide weights.
Sports news– One story mentions a specific American football team, and asks whether this team (the Pittsburgh Steelers) has the best defense line of all time. Sports are of topical interest to people around the world, and there are more soccer fans in the world than American football fans. People want to know which team or which players are the best, either in the current year, or compared historically. All we have from history are specific team performance metrics, and we can create a computer model (meaning applied data mining) to help us discover pattern insights into the sports teams and players considered the best in the world or best in history.
Health news– One story talks about salmonella. Recently in the United States, there was a salmonella outbreak in the peanut industry (believed to be a single processing plant). In any investigation (for health or criminal reasons) an investigator needs to sort through multiple events and factors and decide which ones have a likely causal relationship on a specific outcome. I cautioned earlier about drawing conclusions about causality when there is only associative evidence (guilt by association). In some situations, there might be multiple possible causes. Salmonella is a common bacterium, but there are many such kinds of bacteria, and every peanut processing facility needs to take multiple precautions against the various known causal agents. Sorting among the known candidates to determine the actual culprit requires data collection, and when the collected data involves complex multivariate relationships, data mining may provide insight for discovery.
Microsoft has some case studies online describing specific business implementations of SQL Server (Microsoft, 2009a), but the more important point is to learn about the algorithms and what they do. Once you have learned more about the technology, when faced with a question, you might find it easier to consider data mining as one piece of the technology that you would use in a solution. Therefore, the focus of this community article is not to talk in detail about any data mining algorithms, but instead point to general questions that data mining can help solve.
Not all applications have to do with making money or increasing profit (though these uses are common and valuable). Nonprofit organizations, including many hospitals, can and do use data mining to help discover hidden patterns as they promote their own charitable objectives. All organizations are increasingly creating or accessing larger and larger databases, and analysts looking at this data would be wise to learn more about what data mining can do for them. More details can be found in the data mining texts referenced at the end of this article.
What Are the Costs for Data Mining?
Several quick cost issues to consider:
• Licensing or purchase fees – Software costs vary, and though some software packages are free, others have licensing fees. SQL Server provides ways to monitor usage.
• Hardware and maintenance – Whether desktop or server, all hardware has associated maintenance costs, and backups for data are included in this cost.
• Data collection – Sometimes data are not available to solve a particular question, and other times, only part of the data is available. Purchasing data may be part of a solution, and other times, might require a new data collection strategy.
• Data preparation – Preparing data for mining means not only having it, but also cleaning and perhaps transforming the data for appropriate use. Someone needs to do that data preparation for analysis (and such tasks are common in most business intelligence systems).
• Personal development – Data mining does not work all by itself once you have loaded the software and data onto a computer. A trained data mining analyst commits to learning the technology and also what might be currently possible through custom development. Analysis teams should commit to learning more as new techniques and applications become available.
Larger organizations should intentionally include data mining costs into their normal budgeting process, where these costs can be properly adjusted and evaluated. The qualitative costs include the cultural shifts associated with new technology implementation.
Advantages of Microsoft Data Mining
There are many data mining software products available, free and otherwise. Some important advantages of the Microsoft data mining tools include:
• Tight implementation with a world-class database (SQL Server). Leveraging the performance, security, and optimization features of this database platform is important because Microsoft has created one of the world-leading databases. Many organizations already have data mining licensed through SQL Server 2005 or SQL Server 2008. This technology can access data in other formats through ADO.NET providers, including Oracle, Teradata, DB2, and SAS. Many desktop data mining products (alone) do not provide the scalability or security and access features in SQL Server.
• Applications for production-level business intelligence. A team of developers can integrate data mining into a current business intelligence solution. The integration works best with Windows, but because the interface can be Web-based, you need not be in a Windows-based BI solution to integrate these tools. Specifically, Office SharePoint Server provides a way to distribute and collaborate on data mining solutions through a Web browser.
• Production-quality use and output. Data modeling has the image of strong individuals leading the way. However, as we move forward, Microsoft technology makes it easier to support team ownership of data mining projects, and we can expect that future iterations will help enable teams to data mine together. Those newer cultural norms come from the SQL Server and business intelligence cultures, and I believe this change is a welcome addition to how statistical analysts have traditionally worked. We need our heroes, but we also need teams too.
• Extensibility. It is possible to extend Microsoft data mining to implement algorithms not in the product. For example, someone has written a Support Vector Machine (SVM) algorithm, and that code is on Codeplex (CodePlex Open Source Community, 2009). Also, using the .NET Framework, a software developer could integrate other standalone data mining solutions into the Microsoft solution (the integration depends on how open the other data mining solution is). Microsoft has provided some insider tips and techniques for extensibility in the Web site supported by the Microsoft data mining team (SQL Server Data Mining Team, 2009).
On their Web site, Microsoft provides much free information and tutorials on their software, as part of their commitment to customer, client, and partner service excellence. Many resources that I find useful for Microsoft data mining are included in the Resources section at the end of this paper.
This community article provides an introduction to data mining, describes what the term means, and takes a look at the models that data mining produces. The article also discussed some common business problems solved with data mining. Finally, the article focuses on Microsoft data mining, and outlines advantages for that implementation.
Data mining is an active research field, and you could spend years reading peer-reviewed articles and textbooks on different aspects of the topic. The field has been historically dominated by academic people, and there's much careful thought behind the not only the algorithms but the statistical philosophies of analysis and synthesis. Though I have provided data mining training, and teach at the university level, I consider myself a lifelong student of this topic. You might be or become an important part of that story. I encourage you to share what you know and learn.
Association for Computing Machinery. (2008). Welcome Page. Retrieved June 15, 2009, from http://www.acm.org/
Berry, M. J. A., & Linoff, G. (1997). Data Mining Techniques. New York, NY: John Wiley & Sons Inc.
Berry, M. J. A., & Linoff, G. (2000). Mastering Data Mining. New York, NY: John Wiley & Sons Inc.
Bigus, J. P. (1996). Data Mining with Neural Networks. New York, NY: McGraw-Hill Companies, Inc.
Cerrito, P. B. (2006). Introduction to Data Mining Using SAS Enterprise Miner. Cary, NC: SAS Institute Inc.
CodePlex Open Source Community. (2009). Support Vector Machine plug-in in Analysis Services. Retrieved June 15, 2009, from http://www.codeplex.com/svmplugin
Han, J., & Kamber, M. (2006). Data Mining: Concepts and Techniques (Second ed.). San Francisco, CA: Elsevier, Inc.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York, NY: Springer Science & Business Media, LLC.
Langit, L., Goff, K. S., Mauri, D., Malik, S., & Welch, J. (2009). Smart Business Intelligence Solutions with Microsoft® SQL Server® 2008. Redmond, WA: Microsoft Press.
MacLennan, J., Tang, Z., & Crivat, B. (2009). Data Mining with SQL Server 2008. Indianapolis, IN: Wiley Publishing Inc.
MarkTab Consulting. (2009). MarkTab.net Data Mining Portal. Retrieved June 15, 2009, from http://www.marktab.net
Microsoft. (2008a). Mining Models (Analysis Services - Data Mining). Retrieved June 15, 2009, from ms-help://MS.SQLCC.v10/MS.SQLSVR.v10.en/s10dm_0evalplan/html/cd4df273-0c6a-4b3e-9572-8a7e313111e8.htm
Microsoft. (2008b). SQL Server 2008 Books Online -- How Do I Browse by Job Role (Analysis Services - Data Mining). Retrieved August 15, 2008, from http://msdn.microsoft.com/en-us/library/bb510519.aspx
Microsoft. (2009a). Microsoft SQL Server 2008 Product Information - Case Studies. Retrieved June 15, 2009, from http://www.microsoft.com/sqlserver/2008/en/us/case-studies.aspx
Microsoft. (2009b). MSN Web Portal. Retrieved January 15, 2009, from http://www.msn.com
SQL Server Data Mining Team. (2009). SQLServerDataMining.com Home Page. Retrieved June 15, 2009, from http://www.sqlserverdatamining.com/ssdm/
Tan, P.-N., Steinbach, M., & Kuman, V. (2006). Introduction to Data Mining. Boston, MA: Pearson Education, Inc.
Tang, Z., & MacLennan, J. (2005). Data Mining with SQL Server 2005. Indianapolis, IN: Wiley Publishing, Inc.
Witten, I. H., & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques (Second ed.). San Francisco, CA: Elsevier, Inc.
About the author: Mark Tabladillo is a regular contributor to the MSDN data mining forums. He is owner of MarkTab Inc., has worked as a consultant for Solid Quality Mentors, and is an associate faculty member of the University of Phoenix.
For more information:
http://www.microsoft.com/sqlserver/ : SQL Server Web site
http://technet.microsoft.com/en-us/sqlserver/ : SQL Server TechCenter
http://msdn.microsoft.com/en-us/sqlserver/ : SQL Server DevCenter
Did this paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to 5 (excellent), how would you rate this paper and why have you given it this rating? For example:
- Are you rating it high due to having good examples, excellent screen shots, clear writing, or another reason?
- Are you rating it low due to poor examples, fuzzy screen shots, or unclear writing?
This feedback will help us improve the quality of the papers that we release.