Managing Data Quality with Integration Services and Data Quality Manager

By Andrew Wong, Senior Consultant, and Dean Sutcliffe, Product Manager, for Data Quality Manager.

Data Quality Manager (DQM) is a commercial data integration application that provides Data Profiling, Data Cleansing and Matching, and Management Dashboard functionalities. It lets the user conduct data investigations and profiling, and measure gaps against specified business rules.

Since this article was written before the general availability of SQL Server 2008, it does not discuss the Data Profiling task which is new in Integration Services in SQL Server 2008.

Introduction

Steps in a data quality project

Most data quality projects follow a four-step process:

1. Profile the data.

2. Prepare a data quality score card.

3. Parse & deduplicate the data.

4. Test the data.

This is typically an iterative process.

Figure 1. The typical iterative steps in a data quality project.

On the market today, there are a number of specialized data quality tools, with varying levels of capabilities and cost. Within the Microsoft product line, SQL Server Integration Services has many of the basic data quality functionalities. Integration Services can also be extended to handle complex data profiling and deduplication tasks. The robustness and high performance of SQL Server make it an ideal platform for complex, high-volume data integration tasks.

This article discusses the building of a data quality application that uses SQL Server, Integration Services, and Microsoft Excel. The authors gained this experience by implementing a number of data quality projects, and acquired their engineering knowledge through the development of a commercial data quality software product, Data Quality Manager.

Architecture of a data quality application

The key to success and high performance in a data quality system is to process the data (that is, profile, parse and match the data) in situ within the SQL Server environment. Most data quality tools make a copy of the data, and process this copy in memory to obtain a result. While this is satisfactory for small data sets, the ERP systems and business intelligence platforms of most companies produce sizable quantities of raw data that seriously challenge this approach.

In our experience, we have found that the most efficient approach is to process the data closest to its home ¡V that is, in situ in SQL Server tables. More importantly, by using SQL Server as the processing engine, we are allocating the complex, heavy-duty logic and processing to SQL Server¡¦s data engine, which is built specifically for intensive data manipulation. In addition, SQL Server has a number of facilities, such as special indices and the SQL optimizer engine, that it can use to improve its overall performance.

Data Profiling

Data profiling analyzes the content of critical data items, and tests whether the data is ¡§fit for purpose¡¨. For example, in the context of a customer database, the data profiling question could be: ¡§Are all telephone numbers complete, do they contain valid area codes for the United States, and is each area code consistent with the Zip code of the address?¡¨ Data profiling measures the degree of fit of the data to specified business rules, and presents the result as a percentage score that identifies the percentage of the data that passes the business rule.

Business context

The most commonly encountered profiling issues are related to consumer and commercial customer details. For instance, a business may want to profile the consistency of the following data elements:

Consumer customer name

  • How many customers have a Full Name?
  • What percentage of customers have a Full Name that is spelled correctly?
  • Is the Title of the customer spelled correctly?
  • Is the Gender of the Customer consistent with the customer's Title?
  • What is the most likely Gender based on the Name?

Phone contact details

  • Do all phone numbers follow the same or a valid pattern?
  • How many land line fields contain a mobile number?
  • Are all phone numbers in the United States?
  • How many phone numbers have no area code?
  • How many phone numbers have an invalid area code?
  • Can the phone number be validated from external reference sources?

Commercial customer name

  • Does the Name contain an organization name or a personal name?
  • If the name is the name of a legal entity, what is its company type (Ltd, Pty, Inc, and so forth)?
  • Does the company's registration number follow a consistent pattern for that State?
  • Determine the Standard Industrial Classification (SIC) based on the Company name.

Customer address

  • Is the structure of the address valid?
  • Is the address complete?
  • What is the Zip code, City, County, and State of each address?
  • Is the Zip code consistent with the City, County, and State?
  • Is the Zip code consistent with telephone area code?
  • What is the building type for each address?
  • Can the address be validated from external Postal Service sources?

E-mail address and Web site URL

  • Is the structure of the e-mail address correct?
  • List the company and country code for each e-mail address.
  • List the organization type (.com or .gov) of each e-mail address.
  • Is the e-mail address a company e-mail address that is consistent with the company name?
  • Is the e-mail address consistent with the name of the individual?
  • Is the Web site URL structurally correct?
  • Can the Web site URL be validated as active and accessible from the Internet?

The concerns of most manufacturing organizations center on product and supply chain details. The profiling issues for a manufacturing organization may include:

Product master record

  • How many products have similar descriptions?
  • List all variation of how a product is described.
  • Find the same of similarly products that is classified in different Category.
  • List the products that can be ordered from different branch office of the same  supplier

E-Commerce supply chain

  • Which products are incorrectly classified under the United Nations Standard Products and Services Code (UNSPSC)?
  • In how many categories does a product appear?
  • Which supplier's products do not use standard nomenclature?
  • Which product does not follow the industry standard for unit of measure, package size, and so forth?

Building profiling capabilities in SQL Server

The general approach in building profiling capabilities in SQL Server is:

1. Abstract business rules into logic, which is translated into Transact-SQL as a profile action.

2. Abstract data into patterns and token lists.

Externalizing parts of the business rules into patterns and reference tokens makes the system extendable. These patterns and tokens can be adapted to other projects, or updated to handle data changes in the future.

Profile actions are built as Transact-SQL user-defined functions and stored procedures that use associated token lists. These items are embedded as database objects in SQL Server, and can be called by any utility that can execute a query on the server.

In building the Data Quality Manage product, we built a data profiling user interface that provides an interactive platform for users and analysts. The profile result sets are essentially result sets in tables, which can be displayed by using Excel Services or a similar reporting service. Salient information is summarized into a score card and a dashboard. This facilitates communication with project sponsors and senior management.

Figure 2. The user interface for data profiling in Data Quality Manager.

Profiling functions

In the example below, we have a client who wishes to use the phone number field for their sales force, regardless of which state or country they reside in. The common profiling questions are:

  • Is a phone number of the right length?
  • Does it have the correct area code?
  • How many different patterns do we have?
  • Do we have a textual label (for example, area code 202 ¡V Washington, D.C)?

Below is a simple profiling function that can be run on any SQL-92-compliant server. The purpose of the function is to query a phone number field, then store the result. From the result, the analyst can understand the current state of the data, and assess how much effort is required to get the data into a state fit for use.

Looking at this phone number field on its own will only give us a high level view of the data. However, if we include reference data (such as the area code list for the United States) or alternative fields from within the data set, we can further validate the data or repair values when they are missing or deemed incorrect.

The function below walks the length of the data, testing one character at a time to determine whether it is a letter, number, space or something else, then returns a code that reflects what it found.

Create FUNCTION [dbo].[ufn_PatternDisplay] (@txtstring NVarchar(255))
RETURNS varchar(255)
--|| with encryption -->< --|| <--remove for build
as
BEGIN
-- =============================================
-- Author: Dean Sutcliffe
-- Create date: 4 July 2006
-- Description: replace char in a string with AN # based on content
-- =============================================
declare @i int ,@Output varchar(255),@temp Nchar(1)
set @i=1
set @Output=''
WHILE (@i) <(len(@txtstring) +1)
BEGIN
       set @temp=substring(@txtstring,@i,1)
 Select @Output=@Output + case
       When @temp like '[a-Z]' then 'A' -- if between a-Z call it ¡¥A¡¦
       When @temp like '[0-9]'    then 'N' --if it is a number call it ¡¥N¡¦
       When @temp=' '                    then ' ' -- if a space call it a Space
       Else '#' end                              -- for everything else it's #
       set @i=@i+1
END
RETURN @Output
END

Presenting profiling results

The results of profile actions are contained in result tables. In Data Quality Manager, we have chosen Microsoft Excel as the presentation platform. Excel offers ease of use and seamless integration with the rest of the Microsoft Office suite. This streamlines the process of producing data quality reports.

As an example, here are the data profiling results of the pattern function example shown above:

RecordCount

Customer_Phone_pattern

data_sample

%

123,374

#NN# NNNN NNNN

(02) 4272 9777

36.20%

56,754

NN NNNNNNNN

02 42716136

16.70%

45,678

NNNNNNNNNNNNN

12012282087

13.40%

34,677

NNNNNNNNNNN

293281900

10.20%

34,667

NN NNNN NNNN

02 2816 1447

10.20%

10,004

Null

Null

2.90%

9,701

NNNNNNNNN

26348337

2.80%

6,976

NNNNNNNNNNNN

2026236969

2.00%

6,754

#NNNNNNNNNNNN

3.1018E+11

2.00%

4,567

NN#NNNN#NNNN

02.9417.0799

1.30%

3,456

NNN NNN NNNN

027 572 9265

1.00%

2,345

NNN#NNN#NNNN

072-638-5294

0.70%

1,004

NNNNNNNNNN

49360830

0.30%

345

#NNNNNNNNNNN

5603934334

0.10%

90

NNNNNNNN

17400407

0.00%

90

AAAAAAA

UNKNOWN

0.00%

79

AAAAAA

FOLLOW UP

0.00%

47

#NN#N#NNNNNNNNN

+31(0)102175866

0.00%

From the table above, we can see that data has been keyed to differing standards. We have a small number of clearly invalid records like ¡§UNKNOWN¡¨, a small number of NULL records, and so forth. We can now classify each pattern against business rules (as Valid or Invalid), and derive an overall data quality score for this phone number data.

We can also plan the next steps in how we are going to handle the phone number data. If the database also contains address data, then we can determine whether a number like 002026236969 is in Maryland in the United States, or in Sydney, Australia. After this decision is made, we can reformat the number appropriately.

Profiling with SQL Server Integration Services

The profiling function demonstrated above can be wrapped in an Execute SQL task and published in an Integration Services package. By doing so, the core functionality of Integration Services is extended with data quality tools that are tailored specifically for your data environment.

Figure 3. Data profiling in an Integration Services package

Profiling with SQL Server

These profiling functions can also be called inline, built into an API or the SOAP interface of a Web service, or used as part of a Transact-SQL GROUP BY query.

Here is an example of calling the profiling function described above as a SQL statement:

Select Count(*) as RecordCount,
[dbo].[ufn_PatternDisplay]([Customer_Phone]) as Customer_Phone_pattern,
Min([Customer_Phone]) as data_sample
From [tbl_Customer_Details] CtmrDtls
Group by [dbo].[ufn_PatternDisplay]([Customer_Phone])

Building a Data Quality Score Card

Using a score card to manage and communicate

Data profiling results should then be summarized in a score card that shows how each of the business rules is measured. This provides a business summary of data quality issues in your organization. The score card is an essential business communication tool. It lets the business understand the scope and implication of data quality issues, and to decide on appropriate actions.

When presenting the score card in Data Quality Manager, we present a hierarchy of three concepts:

  • Business rules.
  • The critical data items associated with each rule.
  • The data profiling result.

As an illustration, consider the following example.

ACME Manufacturing has conducted a data quality profiling project, and has presented the outcomes in the following structure:

Business Rules

Critical Data Items

Data Quality profile

Supplier details must be current

Company name

Completed with a legal entity name

Non-blank

Company registration number

Must be completed

Registration number has a valid structure

Number found in field

Parent company

Non blank when parent id has been set

Domicile country

Valid (on reference list)

Web site URL

Completed, with valid structure

Valid if URL is provided

This structure identifies a number of key business rules. This includes "Supplier details must be current" (as shown in the sample above), "Vendor terms must be valid", "Customer delivery details must be up-to-date", and so forth.

Each of these business rules is supported by a number of critical data items. For example, the data items of "Company Name", "Registration number", "Parent Company", "Domicile Country", and "Web site UR " are critical data items that support the "Supplier details must be current" business rule.

Under the "Company Name" data items, the data quality profile measures the percentage that are "Non-blank", and "Completed with a legal entity name". For the critical data item of "Web site URL", the profile measures the percentage that are "Completed, with valid structure", and "Valid if URL is provided".

A simple score card can now be assembled by using the profile percentage scores. The profile can be weighted, and target values can be set. Then a more sophisticated score card that reflects business priorities starts to emerge.

Building the score card

In Data Quality Manager, profiling results are aggregated into a summary result table, and presented in Microsoft Excel.

A summary chart in the form of a polar chart gives a business-wide overview of the data quality summary. Traffic light icons provide an instant view of the data quality score against target.  A detailed grid shows the data quality score and target for each of the critical data items.

Figure 4. The summary view in a data quality score card in Data Quality Manager.

Figure 5. The detail view in a data quality score card in Data Quality Manager.

Using the score card

The score card is a management tool that explains data quality issues in terms of what is important for the business.

For example, the business may discuss:

  • Which data items are at the core of business operations?
  • What is the type of, or the source of, data quality issues?
  • What are the implications for internal processes in data quality improvement?
  • Which organizational unit is responsible for data quality improvement?
  • What is the hidden cost of poor data quality to the business?
  • What level of improvement in data quality is practical?
  • How much data quality improvement has there been since the beginning of a data quality program?

We have found that over two-thirds of the effort in a typical data quality project is spent in data profiling and in communicating with the score card. This highlights the fact that a data quality project is essentially an exercise in business education and communication. Presenting the profile findings in a score card often transforms the data quality discussion from a technical exercise to a business imperative.

Parsing, Deduplication, and Data Enhancement

Patterns and parsing

Parsing is the decomposition of a piece of compound data into its business components. It is a fundamental step in matching, deduping, and enhancing data. By breaking a string into its components, pattern and token lists can be produced, and a solid foundation can be built for sophisticated or fuzzy matching.

Parsing is always specific to the underlying business context. For instance, a 10-digit phone number and a 10-digit product number may have a similar pattern of aaabbbcccc. The phone number parser will recognize that "aaa" is the area code. The parser may use a zip code/area code cross-reference list to confirm that the results are correct.

Some of the commonly used parsing functions include:

  • Street addresses (Western style: US, UK, AU)
  • Street addresses (European style: GN, FA, DA)
  • Street addresses (double-byte: HK, China)
  • Postal data
  • Company name
  • Person name
  • Product description
  • Email address
  • Phone number
  • Account and cost center name

Matching without parsing

SQL Server Integration Services provides a fuzzy matching capability, but its effectiveness depends on matching data items that come from the same business context. In other words, Integration Services fuzzy matching works better when the data is properly parsed.

Let us consider this example to illustrate this point. The business needs to dedupe a Customers list that contains the name, two columns of address data, and the phone number. Below is an example of 4 customer records that include 2 individuals at the same address.

Name

Address1

Address2

Phone no

John Dawkins

1530 Olmo Way, Walnut Creek, CA 94595

961 555 9874

J Dawson

15/30 Walnut Creek Road, Walnut Creek, CA94597

555 9847

John Dawking

Unit 15, 30 Olmo Way

Monks Creek, CA 94598

961 555 9847

Jenny Dawkins

1530 Olmo Way, Walnut Creek, CA 94595

961 555 9847

Figure 6. Fuzzy matching without prior parsing in an Integration Services package.

When we apply Fuzzy Grouping, SQL Server Integration Services will match the set of 4 records into a single set. This is due to the fact that Jenny Dawkins and John Dawkins qualify as a fuzzy match. In the context of our effort to identify distinct individuals, this is a false result (as known as a false-positive).

Match Id

Name

Address1

Address2

Phone no

Set 1

John Dawkins

1530 Olmo Way, Walnut Creek, CA 94595

961 555 9874

J Dawson

15/30 Walnut Creek Road, Walnut Creek, CA94597

555 9847

John Dawking

Unit 15, 30 Olmo Way

Monks Creek, CA 94598

961 555 9847

Jenny Dawkins

1530 Olmo Way, Walnut Creek, CA 94595

961 555 9847

Parsing before matching

In this next example, we have added two preliminary parsing steps to parse the name and address into their components.

Figure 7. Fuzzy matching with prior parsing in an Integration Services package.

Parsing the input string into its components allows the fuzzy matching algorithms to operate on specific components of the data, and to match data coming from the same domain. More importantly, it allows the matching process to be ¡¥discriminating¡¦ when it comes to the relative importance of components. For instance, we may find that Street name and Zip code are more important than Street number or Unit type. We may also find that Phone number is of less importance, if the data is organized geographically.

The following table shows the same 4 records after preliminary parsing of the name and address data:

Name

Initial

Surname

Unit

Unit Type

Street No

Road Name

Road Type

City

State

Zip Code

Area Code

Phone No

John

J

Dawkins

1530

Olmo

Way

Walnut Creek

CA

94595

961

5559847

J

Dawson

15

30

Walnut Creek

Road

Walnut Creek

CA

94597

5559847

John

J

Dawking

15

U

30

Olmo

Way

Monks Creek

CA

94598

961

5559847

Jenny

J

Dawkins

15

U

30

Olmo

Way

Monks Creek

CA

94598

961

5559847

By adding a parsing function in the job stream, we have improved the results of matching:

Match Id

Name

Address1

Address2

Phone no

Set 1

John Dawkins

1530 Olmo Way, Walnut Creek, CA 94595

961 555 9874

J Dawson

15/30 Walnut Creek Road, Walnut Creek, CA94597

555 9847

John Dawking

Unit 15, 30 Olmo Way

Monks Creek, CA 94598

961 555 9847

Set 2

Jenny Dawkins

1530 Olmo Way, Walnut Creek, CA 94595

The parsing function is an essential component of a data quality system. We have found that parsing data before applying fuzzy matching greatly increases the effectiveness of a deduplication process. It makes the matching process more discriminating and more targeted, and provides you with the degree of control that you need over the matching process.

Conclusion

SQL Server Integration Services has much of the basic functionality required to ensure data quality. This article has illustrated how SQL Server can be extended, especially in the area of profiling and parsing. The architecture, performance, and robustness of SQL Server make it an ideal platform to handle complex and demanding Data Quality requirements.

About the authors and their product

Andrew Wong**, Senior Consultant, DQM. Andrew is a lead consultant in data quality projects for DQM. He has substantial experience in large-scale data integration, systems migration and business intelligence projects. Andrew has developed a set of data integration methodologies that have data quality and data governance principles as their centerpiece. He consults in the industry on the planning and management of data quality and data migration projects, writes regularly in data quality journals, and is the key architect behind the DQM product.

Dean Sutcliffe**, Product Manager, DQM. Dean is Lead Architect and product manager of the DQM product. Dean has over 10 years of practical experience in data cleansing and data deduplication, accumulated through operating a data cleansing bureau in a direct marketing agency. During this period, he has observed many varieties of data quality issues, in almost every industry. He has developed a practical approach to data quality management, as he has demonstrated with great skill as the chief designer of DQM. Dean is a Microsoft Certified Professional with substantial knowledge of SQL Server 2005, and has been working with SQL Server since version 4.x.

The DQM (Data Quality Manager**) product****. The learnings from this article are based on the practical experience that the authors gained from the development and implementation of DQM on a number of clients. DQM is a data quality application that provides data profiling; data cleansing and matching, and data quality score card and dashboard. Users can conduct data investigation, data profiling and measure gaps against business rules. DQM can parse unstructured data into its components; cleanse and standardize the content; and match items to identify duplications. The application is preinstalled with templates which correspond to the most commonly encountered data quality issues in the industry. This feature reduces learning time and speeds up a data quality project. DQM is built primarily on SQL Server technologies ¡V its profiling, parsing and matching logic is implemented as stored procedures and user defined functions. It can be deployed as a batch process or as a Web-based real-time system. It has an easy-to-use user interface, and uses Integration Services to provide an alternative launch platform. DQM has demonstrated high performance and scalability to very large data volumes.