Export (0) Print
Expand All
Automate Testing of Your Stored Procs
Streamline Your Database Setup Process with a Custom Installer
Stop SQL Injection Attacks Before They Stop You
Get a Lean, Mean Dev Machine with the Express Editions of Visual Basic and SQL Server 2005
Keep Bad Guys at Bay with the Advanced Security Features in SQL Server 2005
AMO Lets You Dig Deeper into Your Data from Your Own Applications
Make Sense of Your Web Feedback using SQL Server 2005
Fuzzy Lookups and Groupings Provide Powerful Data Cleansing Capabilities
Unearth the New Data Mining Features of Analysis Services 2005
Using XQuery, New Large DataTypes, and More
Expand Minimize
73 out of 92 rated this helpful - Rate this topic

Introduction to SQL Server 2005 Data Mining

SQL Server 2005
 

Craig Utley

June 2005

Applies to:
   Microsoft SQL Server 2005
   Data mining

Summary: Get an overview of typical data mining problems and the tools and models that are available in SQL Server 2005 for solving these problems. (13 printed pages)

Contents

Introduction
The Promise of Data Mining
SQL Server 2005 Data Mining
Business Problems that can be Solved with Data Mining
Conclusion

Introduction

Businesses face a number of problems when attempting to analyze their data. There is generally no lack of data. In fact, many businesses feel they are drowning in data; they are unable to make sense of it all and turn it into information. To this end, data warehousing was developed to allow organizations to take a wealth of data from disparate data sources and turn it into actionable information.

Data warehousing properly implemented is an incredibly powerful solution. A company can analyze information and use it to make informed decisions. You can use data warehousing to answer questions such as:

  • What products are most popular among 15-20 year old females?
  • How do order lead times and on-time delivery percentages for a particular customer compare to the average for all customers?
  • What are the costs and hours spent per patient day by ward?
  • What percentage of projects is stuck in the contracting phase for more than ten days?
  • If a particular facility is spending more money on a particular class of medicines, do patient lab results show a healthier population than those at other facilities?

Beyond these questions, which are generally answered using an analytics application, data warehousing supports a variety of data delivery formats. Analytics applications are designed for analysts who need to slice and dice data and examine lower levels of detail. Reporting applications create standard paper reports or online reports that are either static or offer limited drill-down capabilities for users needing less powerful features. For business decision makers, scorecards are a powerful tool for providing a quick glance at the company's key performance indicators and letting decision makers know where they stand at any point in time.

As useful and powerful as data warehousing is, it has a limitation: it is historical in nature. Data warehouses represent a snapshot in time as they are often loaded and processed on a nightly or weekly basis. While real-time or near real-time data warehouses are possible and will become more common, the data will still represent present and historical data at best. While business decision makers and analysts need to examine this data, data warehouses require a certain amount of analysis in order to dig for cause-and-effect relationships.

Instead of requiring someone to develop a hypothesis and then go digging for the data, wouldn't it be helpful for the data warehouse to determine relationships, predict future events, spot bad data, and allow for the analysis of data in ways that have never been possible? This can be accomplished through the use of data mining. You can use data mining to answer questions such as:

  • What will customers buy? What products sell together?
  • How can a company predict which customers are at risk for churning?
  • Where has the marketplace been, and where is it going?
  • How can a business best analyze its Web site usage patterns?
  • How can an organization determine the success of a marketing campaign?
  • How can businesses prevent poor quality data from entering the system?
  • What are the best techniques for analyzing unstructured data, such as free-form text?

Microsoft has introduced a wealth of new data mining features in Microsoft SQL Server 2005 that allow businesses to answer these and other questions. This paper examines the problems that data mining can address and how SQL Server 2005 addresses those problems. It includes an overview of typical data mining problems and the tools and models available with SQL Server 2005 to solve these problems.

The Promise of Data Mining

Data mining promises to make life easier for business decision makers and analysts. Beyond just predicting future performance, data mining helps identify relationships in the data that might otherwise be hidden, calculating the odds that certain products might sell or certain customers might go elsewhere based on various circumstances, and more.

Microsoft data mining tools go far beyond traditional data mining. While data mining examined the data and looked for previously unknown relationships between elements, it was typically a "run and done" process. In other words, companies ran a data mining tool and then looked at the results. Far from being a tool that does a specific job, Microsoft data mining tools perform their work but immediately apply the results back into the process. This means that data mining models may group customers based on certain criteria, but the analyst can then immediately start slicing data based on these new customer groupings.

This ability to feed the results of the data mining models right back into the analysis process means that organizations can now determine how best to analyze the data. Today, most organizations break customers down by geography and include limited demographic information, such as income or education levels. Data mining might determine that a particular mix of customers makes more sense than any geographic boundaries; for example high-income people who enjoy both classic music and tractor pulls might prefer particular products, regardless of their education levels or geography. Once this customer group has been established, it can be used during the analysis phase to examine data. Analyzing data using groupings that are not inherently obvious is one of the great benefits of data mining.

Being able to analyze data using the output of data mining models is powerful, but these results can also be used during the integration phase. In other words, as an organization consolidates data from throughout the enterprise, data mining can determine if the data makes sense in light of other data. This ability to spot anomalous data during the data integration phase of a business intelligence project means that the resulting data warehouse is cleaner and analysis against it will be more accurate and meaningful.

Taking data mining one step further, data collection applications can apply the rules used in order to screen out bad data input at its source. This means that as data is entered, it can be examined against the universe of existing data and its confidence can be determined. Data that falls below a certain confidence level can be rejected. Thus, if a Web form accepts user information and a user enters a birth date of 2/7/1197, the data can be rejected at its source. Beyond just simple dates, however, the input could look for a combination of birth date, education level, occupation, and income in order to determine a confidence level and accept or reject the data.

Thus, far from being a simple process that is run once and then analyzed, data mining becomes an integral part of the entire development lifecycle of a business intelligence (BI) application, data integration, data analysis, and reporting.

SQL Server 2005 Data Mining

Introduction

The Microsoft SQL Server 2005 Data Mining Platform introduces significant capabilities to address data mining in both traditional and new ways. In traditional terms, data mining can predict future results based on input, or attempt to find relationships among data or cluster data in previously unrecognized yet similar groups.

Microsoft data mining tools are different from traditional data mining applications in significant ways. First, they support the entire development lifecycle of data in the organization, which Microsoft refers to as Integrate, Analyze, and Report. This ability frees the data mining results from the hands of a select few analysts and opens those results up to the entire organization. Second, SQL Server 2005 Data Mining is a platform for developing intelligent applications, not a stand-alone application. You can build custom applications that are intelligent, because the data mining models are easily accessible to the outside world. Further, the model is extensible so that third parties can add custom algorithms to support particular mining needs. Finally, Microsoft data mining algorithms can be run in real time, allowing for the real-time validation of data against a set of mined data.

Creating Intelligent Applications

The concept behind creating intelligent applications is to take the benefits of data mining and apply them to the entire data entry, integration, analysis, and reporting process. Most data mining tools show predictions of future results and help determine relationships between different data elements. Most of these tools are run against the data and produce results which are then interpreted separately. Many data mining tools are stand-alone applications that exist for the purpose of forecasting demand or identifying relationships and their functionality stops there.

Intelligent applications take the output of data mining and apply that as input to the entire process. One example of an application that makes use of a data mining model would be a data entry form for accepting personal information. Users of the application can enter a tremendous amount of data, such as birth date, gender, education level, income level, occupation, and so forth. Certain combinations of attributes don't make logical sense; for example, a seven-year-old person working as a doctor and holding a high-school diploma indicates someone is either filling in random data or showing their inability to handle data input forms. Most applications try to handle such issues by implementing complicated and deeply nested logic, but realistically it is nearly impossible to handle all such combinations of data that are valid or invalid.

To solve this problem, a business can use data mining to look at existing data and build rules for what looks valid. Each combination is scored with a level of confidence. The organization can then build the data entry application to use the data mining model for real-time data entry validation. The model scores the input against the universe of existing data and returns a level of confidence in the input. The application can then decide whether or not to accept the input based on a pre-determined level of confidence threshold.

This example points out the advantage of using a data mining engine that can run in real time: applications can be written that take advantage of the power of data mining. Rather than data mining being the end result, it becomes a part of the overall process and plays a role at each phase of integration, analysis, and reporting.

While validating input uses data mining at the front end of the data integration process, data mining can be used in the analysis phase as well. Data mining provides the ability to group or cluster values, such as similar customers or documents based on keywords. These clusters can then be fed back into the data warehouse so that analysis can be performed using these groupings. Once the groupings are known and fed back into the analysis loop, analysts can use them to look at data in ways that were not possible before.

One of the primary goals of intelligent applications is to make the power of the data mining models available to anyone, not just analysts. In the past, data mining has been the domain of experts with backgrounds in statistics or operations research. The data mining tools were built to support such users, but not to easily integrate with other applications. Thus, the ability to use data mining information was greatly restricted outside of the data mining product itself. However, with a tool that spans the entire process and opens up its models and results to other applications, businesses have the power to create intelligent applications that use data mining models at any stage.

Another aspect of a platform that allows for the creation of intelligent applications is a centralized server to store the data mining models and results. These models tend to be highly proprietary and secret. Storing them on the server protects them from being distributed outside of the organization. An added benefit is that with a shared location for models, companies have a single version of each model, not multiple variants residing on each analyst's desktop. Having a single version of the truth is one of the goals of data warehousing, and this concept can be extended to data mining so that there is a single version of the model that has been created and tuned for the particular business.

Benefits of SQL Server 2005 Data Mining Features

SQL Server 2005 data mining features contain a number of benefits over traditional data mining applications. As discussed previously, SQL Server 2005 data mining features are integrated across all the SQL Server products, including SQL Server, SQL Server Integration Services, and Analysis Services. SQL Server data mining tools are not a single application that a company runs to produce output that is then analyzed independent of the rest of the analysis process. Instead, data mining features are embedded throughout the process and are able to run in real time and the results can be fed back into the process of integration, analysis, or reporting. However, these features would be meaningless if they were difficult to use. Fortunately, Microsoft has focused on making the tools easy to use.

Ease of Use

With SQL Server 2005, Microsoft has sought to move data mining out of the realm of Ph.D.'s and make it available to developers and database administrators (DBAs) for setting up and running data models, and any analyst, decision maker, or other user can use the output of the model with no special knowledge required.

For example, a firm using an early version of SQL Server 2005 wanted to implement a cross-sell application. Cross-selling recommends products to people based on their buying patterns and the product they are currently purchasing. For example, a customer purchasing their third movie starring a particular actress might be interested in movies starring that actress more than other movies in the same genre. On the other hand, customers interested in both science fiction and horror movies would probably not be interested in a cross promotion for a romance movie.

In order to implement a cross-sell application, this company turned to a DBA, not an analyst. The DBA used SQL Server 2005 new data mining features to set up a predictive model that recommended sales based on a number of factors including sales history and customer demographics. The out-of-the-box model is capable of producing one million predictions per second for this particular customer. The end result: sales of recommend products have doubled since implementing the new model.

Simple yet Rich API

The data mining features in SQL Server 2005 make the creation of intelligent applications easy thanks to a very powerful but simple API. This API includes the ability to call predictive models from client applications without having to understand the internals of each model and how they work. This allows developers to call the engine and choose the model that provides the best results based on the data analyzed. Data that is returned is tokenized, meaning that numeric values are returned in a series of attributes. This allows developers to work with simple data rather than some new data format.

Accessing the data mining results is as simple as using an SQL-like language called Data Mining Extensions to SQL, or DMX. The syntax is designed to be approachable to those who already know SQL. For example, a DMX query might appear as follows.

SELECT TOP 25 t.CustomerID   
FROM CustomerChurnModel 
NATURAL PREDICTION JOIN
OPENQUERY('CustomerDataSource', 'SELECT * FROM Customers')
ORDER BY PredictProbability([Churned],True) DESC

Scalability

One of the most important features of data mining in SQL Server 2005 is the ability to handle large data sets. In many data mining tools, the analyst must create a valid random sample of the data and run the data mining application against that random sample. While generating a random sample sounds easy enough, statisticians will provide a host of reasons why generating valid and truly random samples is difficult and fraught with peril.

SQL Server 2005 eliminates the sampling challenge by allowing the models to run against the entire data set. This means that analysts do not have to build sample sets and that algorithms operate on all the data, providing the most accurate results possible.

The SQL Server 2005 Data Mining Algorithms

There are a number of algorithms available in SQL Server 2005 (Table 1).

Table 1. The algorithms featured in SQL Server 2005 Data Mining

ModelDescription
Decision TreesThe Decision Trees algorithm calculates the odds of an outcome based on values in a training set. For example, a person in the age group 20-30 that makes over $60,000/year and owns a home is more likely to need a lawn service than someone in the age group of 15-19 who doesn't own a home. Based on age, income, and home ownership, the Decision Trees algorithm can calculate the odds of that person needing a lawn service based on historical values.
Association RulesThe Association Rules algorithm helps identify relationships between various elements. For example, it is used in cross-selling solutions because it notes relationships between items and can be used to predict what else someone buying a product will also be interested in purchasing. The Association Rules algorithm can handle incredibly large catalogs, having been tested on catalogs of over half a million items.
Naive BayesThe Naive Bayes algorithm is used to clearly show the differences in a particular variable for various data elements. For example, the Household Income variable differs for each customer in the database, and can be used as a predictor of future purchasing. This model excels at showing the differences between certain groups such as customers who churn and those who don't.
Sequence ClusteringThe Sequence Clustering algorithm is used to group or cluster data based on a sequence of previous events. For example, users of a Web application can often follow a variety of paths through the site. This algorithm can group customers based on their sequence of pages through the site to help analyze users and determine if some paths are more profitable than others. This algorithm can also be used to predict, such as predicting the next page a user may visit. Note that the predictive capability of the Sequence Clustering algorithm is something that many other data mining vendors cannot deliver.
Time SeriesThe Time Series algorithm is used to analyze and forecast time-based data. Sales are the most commonly analyzed and predicted data using the Time Series algorithm. This algorithm looks for patterns across multiple data series so that businesses can determine how different elements affect the analyzed series.
Neural NetsNeural networks are the core of artificial intelligence. They seek to uncover relationships in data that other algorithms miss. While the Neural Nets algorithm tends to be slower than the other algorithms, it finds relationships that may be non-intuitive.
Text MiningThe Text Mining algorithm appears in SQL Server Integration Services and analyzes unstructured text data. This allows companies to analyze unstructured data such as a "comments" section on a customer satisfaction survey.

Extensibility

While SQL Server 2005 includes a number of algorithms out of the box, the model used by SQL Server 2005 allows any vendor to add new models into the data mining engine. Those models become peers with the models that come with SQL Server 2005. Algorithms from third parties also benefit from all the other features: they are callable using DMX and are easy to integrate into any part of the Integrate, Analyze, and Report process.

SQL Server 2005 Data Mining and End-to-end Business Intelligence

Data Integration

The integration phase covers the capturing of data from disparate sources, the transformation of data, and loading it into one or many sources. Traditional data mining tools play almost no role in the integration phase, as it is this phase that captures data and prepares it to be mined. While this may sound a bit like a chicken and egg problem, the Microsoft approach to this phase is rather straight-forward: capture the data, consolidate it, mine it, and then use the results of the mining to apply to the current and all future data. Furthermore, the data mining algorithms help companies spot outliers that already exist in the data, or outliers that may be brought in during a traditional extraction, transforming, and loading (ETL) process.

During the integration phase, it is also possible to have missing values provided by the model if interpolated values are acceptable. These values can be from a prior period or can be forecasts of future activity. The advantage that Microsoft data mining tools offer are that the numbers can be generated on the fly during the integration process rather than being provided only after the integration is completed.

Data mining tools are integrated with SQL Server Integration Services. This means that during the data movement and transformation stage, data can be analyzed and modified based on the predictive output of the data mining models. For example, documents or text fields can be analyzed on the fly and placed in appropriate buckets based on keywords within the documents.

Data Analysis

Typical data mining tools generate results after a data warehouse is built and these results are analyzed independently of the analysis done on the data warehouse. Forecasts are generated or relationships are identified, but the result of the data mining models is generally independent of the data used in the data warehouse.

Microsoft tools are integrated with the entire process. Just as data mining is available in SQL Server Integration Services, the benefits of data mining are visible in Analysis Services and SQL Server as well. Whether a company chooses to use relational or OLAP data, mining benefits can be evident during the analysis phase. Thanks to the Universal Data Model (UDM), analysis can be performed against either relational or OLAP data in a transparent manner, and data mining provides a boost to this analysis.

When analyzing certain data elements, such as how products are related or how to group customers based on buying or Web site surfing patterns, various data mining models can determine how to cluster those customers or products into groups that make sense for analysis. When you feed these groups back into the analytic process, the data mining engine allows analysts and users to slice and drill based on these clusters.

Reporting

Once the modeling is complete and an accurate model has been created, the emphasis on data mining changes from analysis to results, and more importantly putting these results to work by getting them into the hands of the right people at the right time. Thanks to the integration between data mining and reporting in SQL Server 2005, providing predictive results to anyone in the organization can be done in a simple, flexible, and scalable manner.

By leveraging SQL Server 2005 Reporting Services, the results of predictive models can easily be deployed to printed reports, Microsoft Office documents, or the intranet by embedding the report into Microsoft SharePoint Services. For example, a department could easily see intelligent forecasts of product sales, or distribute a list of the most likely customers to purchase a product to their call center. They could even see intelligent reports displaying the top ten reasons customers buy or don't buy the product and target their efforts appropriately. Microsoft allows the intelligence and power of data mining to be easily exposed through reporting, delivering meaningful data to users in an easy-to-digest format.

Business Problems that can be Solved with Data Mining

When looking at business problems that can be solved by data mining, most people think of market basket analysis or finding relationships between data that were previously unknown. In reality, there are a variety of problems that can be addressed with data mining, but to do this it is important to realize that data mining can fit into any phase of the Integrate, Analyze, and Report process.

Problem 1: What Will Customers Buy? Which Products Sell Together?

One of the best-known applications of data mining is the traditional market basket analysis, in which relationships between products are examined. Companies, especially in the retail segment, are intensely concerned with learning which products sell together. This allows businesses to promote and cross-sell products. Companies seek to identify relationships between products that may not appear intuitive, such as the classic example of beer and diaper sales being closely related, due to the fact that many men, sent to the store to pick up diapers, also bought beer while in the store.

Data mining helps companies examine all the products and determine the odds of other products being sold with that product. For example, if a business sells a jar of peanut butter, what are the other products that are likely to be sold? Specifically, what are the odds of each of those being sold? Are the odds high of a purchaser of peanut butter also buying bread or crackers? What about jelly? Those might be obvious, but market basket analysis should rank all items that are sold in conjunction with peanut butter, enabling companies to decide how to market and make product placement decisions.

Using SQL Server Data Mining, companies can analyze transactional data in either relational data warehouses or OLAP cubes to find frequent product combinations. The Microsoft Association Rules algorithm determines items that occur together and derives the rules indicating the strength of the correlations. Furthermore, beyond simple analysis, companies can apply models created with SQL Server Data Mining to produce real-time product recommendations either at the point-of-sale for coupon generators, or in an online shopping basket to increase sales of related products.

Problem 2: Identifying Churning Customers

Businesses spend a tremendous amount of time, energy, and money attracting customers. Customer retention becomes a critical issue for companies as the cost of obtaining a customer rises. In some industries, such as the telecommunications industry, customers have a habit of churning, or frequent moving from one carrier to another, signing up to receive bonuses or special offers and leaving as soon as possible for special deals at another carrier.

By identifying customers at risk of churning, companies can better evaluate whether or not to accept those customers or design strategies to reduce churning and thus increase customer retention. Data mining can help identify customers likely to churn by examining customers who have churned and those who haven't, and identifying characteristics that help predict what a new customer might do.

SQL Server Data Mining contains multiple algorithms that can perform churn analysis based on historical data. Each of these algorithms will provide a probability, or likelihood, that each of the customers will stay or go. SQL Server Data Mining provides simple tools that make it easy to determine which algorithms and settings produce the most accurate model for a particular situation, guaranteeing that organizations will get the best results possible. Once a company has chosen the best model, it can put the model to work by using the DMX language and SQL Server Reporting Services to deliver the list of most likely churners via Web reports or a SharePoint portal.

Problem 3: Where has the Marketplace Been, and Where Is it Going?

Predicting future sales is not just a matter of forecasting future revenues; many businesses use forecasted sales to determine staffing levels, order raw materials and supplies, and plan marketing campaigns. Companies that place a heavy emphasis on predictions of future sales or other market activities have long used a variety of statistical methods to perform trendline analysis. Some models let you add forecasts for certain economic factors as well. Where most of these models fall apart is that they cannot create a forecast that considers the impact of additional arbitrary data series. For example, they don't consider product sales when forecasting inventory levels.

The Time Series algorithm in SQL Server Data Mining examines the relationships across multiple series of data taking into account the natural periodicities in the business cycle. Not only can companies forecast results for individual items, they can see how item sales are correlated with other factors or how sales patterns change and evolve over time.

Problem 4: Analyzing the Web Site

Web sites are an integral part of many businesses today. Web sites act as a primary marketing tool, presenting the company face to a worldwide audience twenty four hours a day, seven days a week. Monitoring uptime, scalability, and responsiveness are all critical, but there are other item measurements that are harder to obtain but offer great insight into customers' habits. Once such issue is not just the common paths users take through the site, but how those users can be grouped and analyzed. For example, a site visitor that starts at the home page and then visits the products and contacts pages is different from visitors who start on the product page by coming in from a link on a different site and then visit the services page.

Data mining lets you not just group Web site visitors based on their paths through the site, but also analyze data based on those groups. For example, sales can be analyzed per group of users. Relationships can be established between groups of users, products ordered, and Web site navigation. Based on the entry point of users and the pages from which they came, the effectiveness of various marketing campaigns can be analyzed and used to help predict the results of future efforts.

The Microsoft Sequence Clustering algorithm provided with SQL Server Data Mining allows companies to segment users based on how they use the company's Web site rather than simply what pages they visit. The results of this segmentation can then be imported into Analysis Services cubes for trend and historical analysis. The Microsoft Time Series algorithm can then be used to forecast sales or traffic from those groups providing valuable marketing and operational information on how to organize a firm's Web presence for maximum value and efficiency.

Problem 5: Determining the Success of a Marketing Campaign

Some businesses spend a tremendous amount of money conducting marketing campaigns, but few have the financial resources to conduct in-depth consumer surveys and focus groups to determine the effectiveness of a marketing campaign. Most businesses simply guess at the effectiveness of a particular campaign by examining the sales during a campaign to sales from a previous time period, but this ignores a host of other issues, such as new products the company may have introduced.

By using data mining, companies have the ability to examine the impact of a marketing campaign while considering the current product mix, predicted sales without the campaign, changes in customer demographics, and so forth. Businesses also have the ability to predict the success of future marketing campaigns and funnel their money accordingly.

SQL Server Data Mining provides tools to both help a company target its marketing and optimize its efforts for greatest profits. Using the Microsoft Windows Clustering algorithm businesses can determine the differences and similarities in its customer base and tailor its message appropriately. By using Microsoft Decision Trees and other algorithms, it can predict the customers most likely to respond to a campaign, allowing companies to get the most reward for their marketing dollar. Through SQL Server Integration Services, companies can quickly and easily put these models into production generating mailing lists that are optimized for their best customers with messages tailored specifically for them.

Problem 6: Poor Quality Data

No organization has perfectly clean data. Building a data mart of data warehouses is perhaps the most effective way to uncover bad data, although that is not the intent of data warehousing. Data warehousing uncovers issues such as null or missing values, invalid dates, data in the wrong format, data outside of acceptable boundaries, and data that is inconsistent (such as an order that was shipped before it was placed or hours logged against a facility that is closed.)

When it comes to addressing data quality, the extraction, transforming, and loading (ETL) process takes center stage. It is during this process (especially the transformation part) that data is supposed to be cleansed. However, cleaning data at this stage does nothing to prevent the entry of bad data in the future. Instead, the data entry applications must be modified in order to restrict data input and prevent errors at the source. This has the added benefit of reducing the effort required to create the ETL process.

Often people see data mining as an end result, but data mining techniques are a great way to help verify the input of data in their data entry applications. By mining the existing data, the application can use confidence levels to determine if data input is good or not. This means that the data entry application is not burdened with complicated decision trees that attempt to validate input. For example, if an individual claims to be fourteen years old, hold a Ph.D., and work as a check-out clerk, this data would have a low level of confidence and be rejected at the input level. Having the ability to use data mining of the end result to verify the input of data at the front end is a powerful feature for keeping bad data out of the application in the first place.

SQL Server Data Mining allows companies to identify bad data before it infects data entry systems and reduces overall data quality and analytical ability. Whether companies need to validate data in real time at the point of entry using DMX and .NET programming, or filter outliers in the data pipeline while loading a data warehouse using SQL Server Integration Services, SQL Server Data Mining provides the power to cleanse the data before it impacts an organization's bottom line.

Problem 7: Text Analysis

Many applications allow for the input of free-form text, whether this is via a Web page or a standard Windows application. Storing this data is simply enough, but performing analysis on it is far more difficult. Full-text indexing engines exist, but these tools typically index the text to allow for searching; they do nothing to analyze the data to look for trends or categorize documents based on their contents.

Analyzing free-form text, or text mining, scores the words in a text field or document and pulls out key words. This allows documents or comments to be clustered and categorized. These clusters of documents can then be used to analyze data just as time and products can be used. For example, analysis could be performed based upon a category of documents that mention product safety as a primary component, or it might be possible to segment records based on what customers have listed as their hobbies in a free-form text field. It is this ability to find common themes in documents and other forms of free-form text that allows for the analysis of the data based upon clusters of those documents.

SQL Server Data Mining gives businesses the power to transform the volumes of unstructured data that they acquire into structured information that can be analyzed. After transforming the textual data in SQL Server Integration Services, organizations can load the results into Analysis Services cubes, mining models, or even SQL Server Reporting Services reports to dissect exactly what it is that concerns customers.

Conclusion

The Microsoft approach to data mining is revolutionary. Rather than creating a stand-alone tool to generate groups or predict future results, Microsoft has created a platform that spans the entire process of dealing with data, something they call Integrate, Analyze, and Report.

This means that the output of a data mining model can immediately be applied back to the data gathering, transformation, and analysis processes. Anomalous data can be detected in existing data sets, and new data entry can be validated in real time, based on the existing data. This can free developers from having to create complicated decision trees in application code in an attempt to validate complex input of multiple data values.

Furthermore, model output can be applied immediately to the analysis phase. If a data mining model groups customers into clusters based on a pattern of buying or navigating a Web site, those groups are fed back into the analysis loop so that analysis can be performed with those groups as if they had been built into the data warehouse at the start. Analysts and other knowledge workers can slice and dice by those groupings and discover if certain groups are more profitable, for example.

Microsoft has also built a secure platform in which the mining model and its output are stored in a central location. No longer are models stored on a variety of separate machines where they are harder to control. Additionally, having a centralized model ensures that the same model is used by all analysts and users.

 

About the author

Craig Utley is the Vice President of Development for KiZAN Technologies LLC, where he leads a team focused on business intelligence solutions and enterprise application design and development. He has been working with Microsoft' business intelligence products since their inception and has worked on BI and data warehousing solutions for companies throughout the United States. He is an author, conference speaker, and a Microsoft MVP.

This paper was produced in partnership with A23 Consulting.

Did you find this helpful?
(1500 characters remaining)
Thank you for your feedback
Show:
© 2014 Microsoft. All rights reserved.