Distributed Data Warehousing with Microsoft SQL Server 2000 and Windows 2000 Datacenter Server

Article
01/28/2010

Updated : July 19, 2001

Strategic Customer Relationship Management at Microsoft

White Paper

Executive Summary

The owners of small businesses often know the names of all their customers, especially the names of their top customers. For example, small business owners can easily view lists of individual customer names, billing addresses, and purchases those customers have made. The small business owner often stays in close touch with customers, monitoring their satisfaction with the business, and works to nurture relationships with top customers to ensure they remain top customers.

In contrast, it is far more difficult for owners of larger businesses to maintain such an agile approach to their customers. This is because large organizations have more complications to deal with—not only more customers but also more employees, products, services, and, most important, more paperwork and information of all kinds. What this means is that the owners of larger businesses need to sift through that much more data to determine who their best customers are, which products and services their customers prefer to purchase together, and the overall customer needs based on the purchasing habits of individual customers.

With more than 100 million customers1 in the United States alone, Microsoft relies on a highly sophisticated set of tools and technology—namely, an extremely efficient Microsoft® SQL Server™ 2000-based distributed data warehouse—to effectively manage relationships with its customers. This system has the ability to load over 2 million customer records each day, using just eight computers running the Microsoft Windows® 2000 Datacenter Server network operating system and Microsoft SQL Server 2000, two computers running the Microsoft Windows 2000 Advanced Server network operating system, 550 hard drives, 32 disk controllers, and nearly 1 terabyte of customer-related data.

Also within the data warehouse are 150 million SQL transaction records that support cross-tab reporting and charting capabilities designed to simplify the viewing of comprehensive information about individual customers and their interest in Microsoft products and Microsoft-sponsored events and newsletters.

For Microsoft or any other large organization, designing and deploying a distributed data warehouse environment to support a worldwide Customer Relationship Management (CRM) and Decision Support System (DSS) of such magnitude is not trivial. Without a real-world example, it can be difficult to plan, deploy, and scale an efficient system. This document defines many of the challenges Microsoft faced and the approach it used to scale and deploy its distributed data warehouse to support CRM and DSS functions—functions that previously required the aid of a third party's mainframe environment.

Through a comprehensive explanation of many of the design, development, and deployment decisions made by the Microsoft design and support teams, this document can serve as an example of the approach used by Microsoft to build a highly efficient and distributed data warehouse.

Business requirements

The Microsoft Information Technology Group (ITG) identified three business requirements that drove data warehouse design and deployment to support CRM and DSS functions. These business requirements were enhanced customer satisfaction, economy of scale, and a more standardized and integrated environment.

Customer Satisfaction

Central to the idea of enhanced customer satisfaction is a worldwide CRM and DSS environment that Microsoft began implementing in 1999. With each of these environments, which are central parts of the distributed data warehouse, Microsoft anticipates being more able than ever to determine which products and services individuals and organizations are using.

One of the key functions of the distributed data warehouse is to provide Microsoft with a comprehensive view of each of its customers. Toward that end, Microsoft is working to consolidate customer information from a vast assortment of internal and external sources. Within the warehouse, every "touch" between a customer and Microsoft will be recorded in a central repository. This means that when a customer purchases a product and then registers it—whether using surface mail or e-mail or by subscribing to a newsletter or attending an event—a record of these interactions becomes available, thus providing a comprehensive individualized view of that customer. Over time, it is believed that a more efficient CRM solution will help decision-makers to better understand customer relationships and to develop even better products and services resulting in improved long-term customer satisfaction and loyalty.

Economy of Scale

Microsoft is a highly decentralized business with sales offices in over 60 subsidiaries. Over the years each of these subsidiaries invested in separate customer database solutions to meet its specialized needs. These investments were taken from each subsidiary's budget. For example a subsidiary located in the United Kingdom spent roughly U.S. $1.5 million annually on such functions. Other large subsidiaries were required to pay for the same kind of investment, with a few smaller subsidiaries deciding to forego the investment entirely in favor of customer-related activities and events. This means that a smaller subsidiary was often unable to nurture customer relationships in the same way that a larger subsidiary with more capital resources could.

Subsidiaries might also have multiple databases for each customer segment, or different databases for product-support calls, product registrations, outbound calling, marketing, and so on. Each of the disparate solutions required investments in data acquisition, data processing, support, and the tools needed to query customer data. Because a single investment typically benefited only a single subsidiary, the legacy approach had little economy of scale.

Therefore, another aspect of the move at Microsoft to design and deploy a distributed data warehouse capable of supporting worldwide CRM and DSS is an effort to take advantage of economy of scale, where a single centrally managed solution is of benefit to decision-makers at all subsidiaries.

Standardization

Standards are also central to the move to design and deploy a worldwide consolidated CRM and DSS environment using a single centrally managed distributed data warehouse. In the past, there were little, if any, standards among the various implementations used among subsidiaries in terms of how they approached CRM. For example, some developed a custom-coded application to support sales, marketing, and customer-support groups. Others used third-party applications or outsourced the entire customer-relationship function.

To address the problem, Microsoft launched an internal initiative to develop a standardized and scalable approach to CRM and DSS that would benefit decision-makers at all subsidiaries. The process began by consolidating information among disparate systems located in the United States and by designing and deploying the distributed data warehouse that is the subject of this paper.

Overview

Because of the sheer volume of the Microsoft customer base, one of the first decisions warehouse designers made when planning the solution was that it be distributed across multiple servers. The motivation for this was powerful. A distributed design allows for multiple servers to work concurrently on individual tasks (a task being one or more SQL stored procedures). It also makes scaling easier because it supports the identification of tasks that can be divided into sub-tasks that in turn can run concurrently on multiple servers. At the programming level, this involves the division of a SQL stored procedure into two or more new stored procedures and then loading the stored procedures on separate computers running SQL Server 2000 Enterprise Edition.

Even though the warehouse has been implemented as a distributed system, each server is capable of sharing a portion of data with other servers using Microsoft Cluster Service, included in Windows 2000 Datacenter Server, and a Storage Area Network. Each of the clustered servers performs specialized data processing and then shares the results of that processing by placing a database on a clustered disk resource. Subsequent processing continues later with more specialized data processing when another server requests the shared disk resource and reloads a shared database.

Ultimately, decision-makers have visibility to the data through Web-based campaign management and DSS tools. Using these tools, decision-makers are able to view customer lists based on specific profiles and generate campaigns and events based on virtually any profile they choose. As of March 2001, the warehouse contained information pertaining to roughly 100 million Microsoft customers residing in the United States, with plans to include information pertaining to customers residing in other countries in the future.

No matter where in the company U.S.-based customer information originates, it is securely stored on servers residing at Microsoft headquarters in Redmond, Washington. Strict security is enforced on these systems 24 hours a day, 7 days a week.

The data warehouse is based on the following Microsoft products:

Windows 2000 Datacenter Server, including Microsoft Cluster Service2
Windows 2000 Advanced Server, including Internet Information Service
Microsoft SQL Server 2000 Enterprise Edition

Data in the warehouse comes from a wide variety of internal and external sources, including the microsoft.com Web site and Dun & Bradstreet. With a capacity to add 2 million records each day, the warehouse is almost always loading information on new or repeat customers who are registering new products, subscribing to newsletters, or attending company events.

Because the data warehouse is managed centrally, it requires relatively few maintenance resources. Just 23 people were needed to design, deploy, and now maintain the data warehouse, which after roughly one year of operation has already demonstrated several clear benefits. For example, the data warehouse is helping to provide a more comprehensive view of U.S.-based customers and is reducing the time previously required in the outsourced arrangement to implement new features and reporting capabilities and formulate plans for nurturing relationships with the company's top customers. Figure 1 provides an overview of the distributed design used within Microsoft and of the servers and their functions.

Figure 1: Various combinations of Microsoft technologies support the design of the CRM and DSS data warehouse

The internal data-warehouse audience consists of two primary groups of decision-makers: those seeking market trends and those doing data mining. The first group observes various market segments and uses fairly basic reports to do cross-tab reporting and historical analysis, such as the number of individuals who purchased a product after attending an event or subscribing to a newsletter. The second group mines for cross-sell and up-sell opportunities for the purpose of planning future activities and events.

Design

Because both the logical and the physical architectures of the distributed data warehouse are key to its distributed nature and scalability, the following material details the efforts made in sectioning the warehouse into logical and physical layers and in designing the flow of data through the warehouse. Some efforts are attributed to the need to obtain customer name and addresses from discrete sources, as opposed to leveraging existing customer identification numbers. (For more detail on challenges faced in implementing the logical architecture, see the "Implementation" section.)

Logical Architecture

In designing the logical architecture of the distributed data warehouse, engineers considered the primary events that must occur in the system to satisfy business requirements and the order in which these events must occur.

Events

Engineers defined six data-warehouse events as primary: data mapping, standardization and matching, operational data storage, decision support, campaign management, and data suppression.

Data mapping. Customer information comes from many data sources, including product-registration cards, event-registration cards, electronic surveys, and subscriptions to electronic newsletters. Each data source obtains information that is common to all the data sources (such as "customer name") as well as information that is unique to a given data source (such as "subscription start date"). Data mapping standardizes the position of data fields in tables so that a customer's name, for example, is not inadvertently stored in a field for a customer's address. Data mapping also arranges the data fields provided by various data sources so that each input file uses the same data-field format.

Standardization and matching. As with any data warehouse, the Microsoft CRM and DSS data warehouse includes provisions to guard against unnecessary duplication and redundancy. For example, when a customer changes address, phone number, or occupation, the data warehouse must reflect those changes without inadvertently updating the wrong record (such as that of another customer with the same name) or creating duplicate records (such as a customer listed once with the original address and then again with the new address). By using standardization and matching procedures for every such update as well as for all new customer records, the data warehouse ensures it is building up a robust and accurate view of individual customers.

Operational data storage. For homogenizing the data in preparation for perpetual storage, the data warehouse uses an Operational Data Store (ODS) based on a star-schema format for a flexible and perpetual storage environment. It is from here that the DSS and campaign-management "factory processes" obtain their data.

Factory processing. In the distributed data warehouse, factory processing involves the transformation of data in the ODS so it is suitable for reporting, analysis, and campaign-management activities. Factory processing works by taking data from the ODS and enriching it by applying business rules so as to denormalize, aggregate, and partition it, thereby ensuring a suitable format for use in the data marts. According to the design of the warehouse, "individual customer" is the lowest level of aggregation in both the ODS and data marts.

Decision support. Decision-makers must be able to count the number of individuals who are registering products, attending events, joining newsletters, and so on. They also must be able to observe a given market segment for up-sell and cross-sell opportunities. For these reasons the Microsoft CRM and DSS data warehouse includes a flexible DSS environment.

Campaign management. Decision-makers also must be able to invite selected customer to company-sponsored events and contact them using either bulk e-mail, surface mail, or customer calls. For this reason, the data warehouse includes campaign-management capabilities.

Data suppression. While it is vital that Microsoft be able to contact most of the customers in its data warehouse on a regular basis and by whatever means are most practical, some customers prefer not to be contacted or to be contacted only through a certain means (e-mail but not phone, for example, or surface mail but not fax). Through data suppression, the Microsoft CRM and DSS data warehouse takes such considerations into account in determining which customers to contact and when and how to contact them. Data suppression is one component of the Microsoft effort to implement security and privacy preferences in the data warehouse. Not only does the warehouse indicate preferred contact and non-contact information for individual customers, it also limits access to sensitive data so that analysts can view counts without actually pulling names and addresses.

Order of Events

After defining the primary events in the data warehouse, design engineers determined the order in which these events must take place. As illustrated in Figure 2, the events are grouped according to data load, factory processing, and the generation of data marts.

Figure 2: The logical architecture of the Microsoft CRM and DSS data warehouse

Here is a brief summary of the data flow:

the data is securely stored in the ODS,
the data is moved to factory servers where aggregation and acceleration tables are built using SQL stored procedures, and
the transformed data is moved to a decision-support or campaign-management data mart where it is stored using a relational star-schema format.

Throughout the process, data is kept moving through the system through the application of business logic and rules, the intermediate results of which are made available to various servers that need the data for additional processing. This approach allows each of the servers in the data warehouse to have a specialized role and ensures that all the servers remain busy.

Physical Architecture

In designing the physical architecture of the Microsoft CRM and DSS data warehouse, engineers considered the logical architecture, the volume of input data, the complexity of business rules, and the availability of high-speed computers. To make the most efficient use of hardware and to be able to scale by adding servers, the engineers modeled the physical architecture after the logical architecture using a distributed computing environment.

Following the "Just in Time" methodologies that became popular in the early 1990s, this design enables an organization, such as Microsoft, to apply financial resources only when more processing is needed. An added advantage of this approach is that it enables an organization to take advantage of the most modern and most powerful computing equipment exactly when more computing power is needed. Figure 3 illustrates the way Microsoft scales the data warehouse following a Just in Time methodology.

Figure 3: Scaling the distributed data warehouse using a Just in Time methodology

The physical architecture also specifies how many servers are needed, what products are required to run on each server, and how networking and storage issues are to be addressed. For the Microsoft CRM and DSS data warehouse, the architecture specifies eight servers running SQL Server 2000 Enterprise Edition on the Windows 2000 Datacenter Server network operating system and two servers running the Windows 2000 Advanced Server network operating system for middle-tier reporting. SQL Server 2000 Enterprise Edition was required for the system to use more than 3 GB of memory. Figure 4 illustrates this arrangement, and Table 1 details the various hardware configurations used.

Figure 4: The physical architecture of the distributed data warehouse

Table 1 Hardware configurations used in the Microsoft CRM and DSS distributed data warehouse

Primary Server Role	Number of Processors	Processor Speed	Memory	SQL Server Database Size
Standardization and Matching	4	500 MHz	4 GB	8 GB
ODS	8	550 MHz	4 GB	900 GB
Campaign Management Factory	8	550 MHz	4 GB	1100 GB
Campaign Management Data Mart	8	550 MHz	4 GB	650 GB
Campaign Management Data Mart	8	550 MHz	4 GB	650 GB
Decision Support Middle Tier	4	500 MHz	4 GB	8 GB
Decision Support Factory	8	550 MHz	4 GB	600 GB
Campaign Management Middle Tier	4	500 MHz	4 GB	N/A
Decision Support Data Mart	8	550 MHz	4 GB	300 GB
Decision Support Data Mart	8	550 MHz	4 GB	300 GB

As of this writing the warehouse uses more than 550 disks and 32 disk controllers for a combined total of 17.3 terabytes of unformatted physical storage. Seven servers are using Microsoft Cluster Service, included in Windows 2000 Datacenter Server, and a Storage Area Network (SAN) to share almost 1 terabyte of SQL Server 2000–based data and make it globally available to other servers running SQL Server 2000. Figure 5 illustrates the logical architecture of the SAN, which uses Logical Unit Number (LUN) masking. The shared logical drives are managed using Microsoft Cluster Service.

Figure 5: Logical SAN architecture

Data Flow

To share the intermediate results of specialized processing with other servers, the data warehouse takes advantage of Microsoft Cluster Service, a SAN, and a SQL Server 2000 backup. The SQL Server 2000 backup is used to place a database on a clustered drive resource and then Microsoft Cluster Service is used to apply "failover" from the drive resource to another server on the request for the drive resource. At this point the other server can immediately reload the database and resume additional specialized processing by executing more specialized SQL code.

Servers in the distributed data warehouse are cabled to fiber channel controlled disk volumes. They are also cable to an Asynchronous Transfer Mode (ATM) network. Initially, production support staff shared data between servers by copying the data over the network. This approach worked well when the warehouse contained only a few million customer records, but for larger data sets the approach did not scale well because the copy operation took days to complete. Load operations of the same volume of data now occur within minutes using the SAN, SQL Server 2000 backup, and Microsoft Cluster Service failover.

Note that although Microsoft Cluster Service is often deployed for improving fault tolerance and availability, in the distributed data warehouse it is not used for this purpose. Instead, it is used strictly for sharing SQL-based data among clustered nodes. Data is continually being loaded into the ODS. Each week the operations team stops the loading process and backs up the ODS to a SAN, using the SQL Server backup utility. Once the backup is complete, the team uses Microsoft Cluster Administrator to apply failover from the drive containing the backup to a decision-support factory server, where it is immediately reloaded.

The decision-support factory continues to process the data by applying business rules and data transformations using SQL. This process runs for roughly four more days. Then the factory server begins running its current factory processes while initiating a SQL Server backup of an ODS-derived database to a shared drive resource.

After the backup is completed, Microsoft Cluster Service is used to apply failover from the shared drive resource to the campaign-management factory server. Once this has occurred, the campaign-management factory server uses SQL Server 2000 to restore the database and begin processing the data to support campaign management. This processing occurs concurrently with processing being performed by the decision-support factory.

The campaign-management factory runs SQL stored procedures for roughly 12 days while applying business rules and data transformations to the ODS-derived database.

After the decision-support factory has finished executing SQL code, the factory initiates a transfer of its database to two different decision-support servers using the SQL Server backup utility and shared drive resources. Each decision-support server then restores the database contained on the shared drive resource resulting in data marts being available for campaign management and reporting purposes. Figure 6 illustrates this data flow.

Figure 6: The physical architecture and data flow of the Microsoft CRM and DSS distributed data warehouse

Integration

The distributed data warehouse integrates more than 30 different data feeds originating from both inside and outside the company. These data feeds represent many of the different ways that U.S.-based customers can initiate contact with Microsoft, including product registrations, newsletter subscriptions, event registration, and change of address. The data warehouse matches information from each of these sources to individual customers and accounts and individual names and addresses.

The warehouse also integrates with internal business systems, such as MS Sales and the World Wide Events system. MS Sales is a data warehouse containing information about distributor, reseller, and customer sales. By pulling data from MS Sales, the distributed data warehouse can provide information on how company-sponsored activities influence purchasing decisions. Integration with MS Sales is designed to help decision-makers determine how event attendance and newsletter subscriptions influence a customer's decision to make a second or third purchase. By pulling data feeds from the World Wide Events system, the data warehouse maintains up-to-date information on customer registrations at company-sponsored events.

Implementation

Implementing the data warehouse involved designing the logical and physical architectures, deploying a SAN, installing SQL Server 2000 and Windows 2000 Datacenter Server, and configuring Microsoft Cluster Service as well as software provided by third parties. Designers are implementing the distributed data warehouse in phases. They completed phase 1 in November 1999 and phase 2 in August 2000. As of this writing the implementation is now in phase 3.

Phase 1 consisted of deploying the physical and logical infrastructure of the warehouse and developing a user interface for decision support. The design team limited data sources to just three or four in this phase so as to limit the volume of data the warehouse would be required to deal with.

Also during phase 1 decision-makers began looking at the kind of information the warehouse would make possible and began to provide feedback before moving on to the next phase of implementation. Roughly 30 subsidiaries are now regular users of features implemented during phase 1.

In phase 2 engineers significantly expanded the scale of the warehouse by adding data from nearly 30 more sources. They also implemented a more robust data processing and validation environment, a Trillium matching environment, and a data factory that would support campaign-management activities.

As of this writing, data in the CRM data mart is limited to customers residing in the United States. The DSS data mart contains data pertaining to customers residing in the United States as well as some worldwide data. Engineers used U.S.-based customers as the initial data set for the warehouse because it was the largest of all the Microsoft customer data sets. As such, the team considered it a useful basis for stress testing before they began incorporating data sets from other regions, which they are now doing as part of phase 3.

Implementing the data warehouse also involved custom-coding business rules and algorithms using SQL, which is one of the implementation areas that required a sustained and specialized effort to make the distributed warehouse a success. For this reason, the discussion will take the reader through implementation areas that required the warehouse design team to use SQL without including actual source code. (Note that this discussion assumes that the reader is already familiar with data-warehousing terminology, basic programming concepts, and the construction of relational databases.)

Data Loading

The process of turning granular data into information suitable for corporate business decision-making begins by loading data into the Operational Data Store (ODS). The ODS is a SQL Server 2000-based dimensional model, not an Entity Relationship Database (ERD), and is based on a star-schema format. The purpose of the ODS is to homogenize the input files for later reporting.

Data is loaded into the ODS, according to customer postal code, from more than 30 different sources. They include product-registration cards, event-registration cards, customer surveys, electronic product registrations, and some additional data provided by Dun & Bradstreet, data such as current mailing address and any legal name changes. After loading, data from each source is mapped to common input fields, validated, matched with existing records in the warehouse, and then stored using edit precedence.

Up to 18 SQL load jobs are run concurrently. Running jobs concurrently is required to support the number of records that are being loaded each day into the ODS.

Data Validation

Data validation begins with a verification that each input file contains an appropriate number of columns and that each column is properly located in the file. A "Do Not Load" flag identifies columns that should be excluded from the load process.

All rejected records are made available for manual review. The syntax of each e-mail alias is validated before each record is loaded into the warehouse.

State and province codes contained in each address are validated by a comparison to a stored geographic taxonomy. Values that do not correspond with the taxonomy are preserved and loaded as free-text.

Free-text fields are compared against an "unacceptable values" table. This process is used to interrogate free-text data that is supplied by anyone using the Internet. The process also serves to validate and confirm that free-text is suitable for long-term storage and reporting.

Standardization and Matching

When an existing customer registers a second or third product, the data warehouse uses standardization and matching to associate the customer with one already existing in the database. To support special cases, data-warehouse designers implemented their own match-coding algorithms in the ODS.

The ODS identifies records by country, which are standardized using third-party software. Wherever its use is feasible, the third-party software is used for address matching. In some cases, match-coding techniques within the ODS are used instead.

Names such as "Bill" and "William" are standardized so that when Bill and William are the same person, this is reflected in the warehouse. GEO-coding is used to associate postal codes to a given city, town, or district. GEO-coding capabilities are provided by third-party software.

Occasionally, address records are sent to the United States Postal Service to determine whether addresses have changed. If they have, the new addresses are used.

Data from the majority of sources is first run through processes that match the data and remove any duplicates. Next the data is passed to third-party software, which standardizes names and addresses such that if a zip code is missing one will be provided. Records containing repeat customer names and addresses are loaded into the warehouse after associating customer names with existing records in the warehouse. After the data has gone through the standardization and matching processes, it is stored in the ODS, using edit precedence.

Standardization and matching provide a number of key benefits for the Microsoft CRM and DSS data warehouse:

They help to maintain the uniqueness of individual customer records.
They ensure that any additional data loaded into the data warehouse (e.g., subscriptions, registrations, event attendance) is matched to the correct customer.
They help to keep addresses and phone numbers accurate and up-to-date.

Edit Precedence

Edit precedence was developed using SQL and is used to statistically rank data sources at the field level, which is essential for increasing the accuracy of data in the warehouse over time. Every field in each data source receives a rank indicating the statistical accuracy of data contained in that field. Some data sources provide very accurate data, while others provide less accurate data. Each data source provides a few unique pieces of information and some information that is common to many data sources. All available fields from all available data sources are needed to build a comprehensive view of each customer.

Edit precedence solves the problem of determining which fields in which source or sources will provide the data capable of producing the most accurate view of each customer. For example, as data in each field is saved to the database, the data source is recorded with the field. By recording the source that provided data for each field, edit precedence determines when to replace data provided by less-accurate sources with that provided by more-accurate sources. This approach also enables the replacement, if necessary, of all the data provided by a given source.

Data-warehouse designers use edit precedence to generate the statistics that help them catch data-entry errors, which are common among almost all sources, to determine which data sources and fields are more or less accurate. One data source may provide an outdated address, for example, while another source may fail to reflect a recent name change. Accordingly, with edit precedence, warehouse designers can determine that it's best to use an address field from the second source and a name-change field from the first.

Table 2 provides an example of how edit precedence might be used. Fields 1, 2, 3, 4, and 5 are provided by sources 1, 3, 2, 3, and 1, respectively. In this example three sources are required to provide five fields of data. Fields 4 and 5 are initially provided by data sources having a low field ranking. Using edit precedence, the expected outcome is that the data in fields 4 and 5 will be replaced by data from sources having a higher field ranking, as defined by edit precedence.

Table 2 Example of edit precedence using three data sources

	Field 1	Field 2	Field 3	Field 4	Field 5
Source 1	3	1	1	0	1
Source 2	2	2	3	0	0
Source 3	1	3	2	1	0

Edit precedence provides a number of key benefits for the Microsoft CRM and DSS data warehouse:

It generates cost savings through the elimination of one or more data sources.
It results in field-level data becoming more reliable over time.
It simplifies the replacement of data from one source with data from another source.

Primary Record Types

The ODS contains two primary types of records: individual-based records and organization-based records. Individual-based records refer to individual customers who have purchased at least one product, attended at least one event, or subscribed to at least one online newsletter. Individual-based records, which are identified by residential addresses, do not ordinarily contain the names of organizations.

Individual-based records provide support for multiple e-mail aliases including the data source of each alias. A household key is used to bring together individual customers at the same physical location.

Organization-based records refer to businesses, companies, agencies, or divisions. An organization-based record is loaded into the warehouse provided the record is tied relationally to an individual-based record.

Telephone numbers are associated with individual-based records, but not organization-based records. Data suppliers periodically supply new telephone and fax numbers so that they are kept up to date. Telephone numbers are associated with existing fax numbers to ensure that a customer's preference is preserved even after a fax number is changed. If necessary a flag can be set to associate a telephone number with a company's home-office. Suppression flags accommodate each customer's preference of contact. For instance, a flag may be set indicating that contact via fax should not be attempted.

Factory Processing

Factory processing, a term used within Microsoft, applies to warehouse implementations that distribute the ODS, the application of business rules, and data marts among multiple computers. Factory processing is a customized process developed by the warehouse design team using SQL and is specific to the distributed nature of the warehouse within Microsoft.

Factory processing is used to periodically extract data from the ODS to perform transformations, aggregations, derivations, and partitioning before the enriched denormalized data is published through the generation of new data marts. Using the data marts, decision-makers can drill down and analyze individual customer transactions, produce reports, and support campaign-management activities. Figure 7 illustrates the flow of data from ODS through the factory process.

Figure 7: Factory-processing data flow

Factory processing provides a number of key benefits for the Microsoft CRM and DSS data warehouse:

It provides an effective way of scaling the warehouse beyond 1 terabyte.
It enables servers to perform more specialized processing.
It simplifies the implementation of additional servers when they are necessary.

Data Marts

Data marts are SQL Server 2000-based databases optimized for efficient reporting. Factory processing generates new data marts on a weekly basis. The data marts provide a robust decision-support and campaign-management environment.

Once the data marts have been generated, a combination of internally developed and third-party applications are used to pull data and graphically display it in a graphical user interface. The graphical user interface permits decision-makers to gain visibility to the data and to conduct analyses. For example, one graphical user interface is used to pull data from four data marts, allowing decision-makers to "peel back" data layers to expose virtually any target audience. Doing this enables decision-makers to better understand which products, newsletters, and events are preferred by various target audiences.

Warehouse designers implemented four perspectives in the data marts so as to ease access to information for decision-makers. These perspectives are customer data, product registration, online subscription, and customer events.

Customer-Data Perspective

The customer-data perspective provides decision-makers with a comprehensive view of individual customers and organizations according to geography, profile, interests, and account activity. This perspective accommodates complex queries involving cross-activity analysis including profiles of individuals and organizations. The perspective counts individuals and organizations depending on which virtual perspective is being used.

Two virtual perspectives are based on the customer-data perspective: individual and organization. These virtual perspectives are similar to the base customer-data perspective except that the virtual perspectives can produce reports based on a specified number of individuals and organizations and the base perspective cannot.

Figure 8 illustrates a small segment of a fact table and some related dimensions in a data-mart customer-data perspective.

Figure 8: Example of data-mart information available to corporate decision-makers

Product-Registration Perspective

The product-registration perspective stores product-registration transactions in a time-based view. The principal use of this perspective is to establish product-registration trends by individual customers over time. The default queries are fiscal month, subsidiary, sales location, product name, and count of product registrations.

Online-Subscription Perspective

The online-subscription perspective stores online-subscription transactions in a time-based view. The principal use of this perspective is to establish online-subscription trends by individual customers over time. For newsletter-related activity, the perspective includes all subscribe and unsubscribe activity, including multiple subscribe or unsubscribe actions by a single individual.

Customer-Events Perspective

The customer-events perspective stores information about customer-events activities in a time-based view. The principal use of this perspective is to view customer activities connected with specific events over time. The perspective counts individual invitations, registrations, confirmations and event attendances.

Support

The following section of this paper details some of the more important support tools used by the data-warehouse team for monitoring and maintaining the system.

These tools help to collect some information proactively, so that if a problem does arise the information can help to narrow down its potential causes; they collect other information at the time a problem arises. The primary support tools consist of Windows 2000 Performance Monitor, Windows 2000 Event Viewer, SQL Server-based e-mail, Windows 2000 Terminal Services, and SQL code.

Windows 2000 Performance Monitor

Windows 2000 Performance Monitor, which is included in Windows 2000 Professional, Windows 2000 Advanced Server, and Windows 2000 Datacenter Server, is used for locating performance bottlenecks caused by system or software components. Output from Performance Monitor is displayed graphically and supports the concurrent viewing of multiple system components and application processes. Performance Monitor is the first tool used by the operations team to determine whether a process or application is allocating too much memory or using too much processor time.

Data provided by Performance Monitor is extremely valuable to analysts and operations teams when it is time to tune operating-system performance. At Microsoft, only server administrators may run Performance Monitor, because experience has shown that information can readily be captured from a single machine and shared among analysts when necessary. Experience has also shown that performance is slightly improved by limiting the number of computers that are used for simultaneously viewing performance data. At Microsoft this number rarely exceeds five.

There are many reasons for initiating performance analysis using Performance Monitor, chief among them when an important process begins requiring far more time to complete than it did in the past. The decision to initiate a performance analysis comes from a group that is responsible for collecting performance data on a few computers dedicated to that purpose. Data is collected more-or-less following the scientific method, with collection taking place while suspect processes are run. Administrators add all performance counters (except for network segment) and log the results to disk, for a period of 48-72 hours. Then they pass the log file to a support analyst who analyzes the performance data.

The Performance Monitor also is used for systematically overlaying performance counters with memory, CPU, and disk counters, which are needed to identify which process or processes may be causing a given performance problem. Depending on the outcome of the analysis, performance problems can generally be resolved by the addition of more hardware, adjustment of the timing of SQL jobs, or the moving of a resource-intensive process to another server.

Event Viewer

The Event Viewer displays information based on event messages that SQL Server 2000 and Windows 2000 Datacenter Server write to application and event logs located on each server. Each event message has a corresponding ID that can be used to help determine the nature and severity of many common problems, such as full disks, failed hardware devices, or processes that have run out of memory. Through the Event Viewer, database administrators (DBAs) can use the event ID for searching through the Microsoft Knowledge Base (https://support.microsoft.com/default.aspx?scid=FH;EN-US;KBHOWTO&sd=GN&ln=EN-US?ln=en-us) for more detail on the potential cause of a given problem.

SQL Server-Based E-mail

The operations team relies on SQL Server-based e-mail to keep track of SQL Server 2000 activities in the Microsoft CRM and DSS data warehouse. SQL Server-based e-mail is configured to periodically send notification to the production team so they have some insight as to what is going on inside the system.

SQL Server-based e-mail is also used to automatically page someone on the production support team if an issue warrants immediate intervention. The data warehouse initiates SQL Server-based e-mail by programmatically calling the extended SQL stored procedure XP_SendMail from code locations that warrant e-mailing the production team.

Windows 2000 Terminal Services

Members of the operations team use Windows 2000 Terminal Services, which is part of Windows 2000 Datacenter Server, for remote operations. With Terminal Services, operations personnel at any location can install, configure, and troubleshoot SQL Server 2000 as if they were working at the system console.

SQL

Many of the processes used by the ODS, factory, and data-mart servers were developed using SQL. Within each of these processes, the design team implemented diagnostic code that provides a continuous trail of information to help warehouse operations staff to determine what is happening inside the warehouse. Information such as process identification numbers and the time required for completion of each stored procedure is continually written in SQL-based tables for reporting purposes. Information contained in the tables can be viewed at any time for determining how well the system is functioning from a SQL perspective. Recorded information enables developers to observe performance over a period of time and plan for additional capacity.

As a best practice, messages generated in SQL processes are logged to files, because many of the jobs run for a long time. These jobs are continually generating informational messages, and experience has shown that logging such messages to files helps to ensure that the display length will not be exceeded.

Supporting Clustered Resources

Understanding how to support clustered resources is vitally important in a system such as the distributed data warehouse, where timing is key to the flow of data. In the warehouse, seven servers are clustered for the purpose of sharing disk resources. Each of the seven servers has its own local storage as well as global storage created by Microsoft Cluster Service and a SAN. If a server needs to read data from global storage, the server requests a clustered disk resource containing the data. If a server needs to write data to global storage, the server will request a clustered disk resource to which it will write the data.

Warehouse designers implemented the warehouse so that requests from SQL Server 2000 for shared disk resources use SQL. As a server executes SQL code, a statement in the code will eventually be executed that instructs Microsoft Cluster Service to assign a disk resource to the server.

This implementation was not the original as set forth by the warehouse designers. At first, they attempted to assign disk resources through the calling of cluster.exe from multiple servers. But this method did not work well, because it meant that shared drive resources were taken away from servers while SQL Server was either reading data from or writing data to them. So the production support team stepped in and overcame the problem by developing a shared drive interface using SQL statements. The drive interface is used as a locking mechanism so that shared disk resources are not taken away from servers that are using them. Servers that require a shared drive resource call the drive interface code and then go into a SQL wait loop until failover is applied to the drives.

The drive interface code keeps track of the shared disk resources by recording that information in a table. The drive interface code is the only location from which cluster.exe is called. All servers that use shared disk resources in the warehouse do so through this common interface.

Security

Strict security is enforced to ensure the confidentiality of customer data within the CRM and DSS data warehouse. Security includes, but is not limited to, securing the physical hardware and the computing, network, and software components of the data warehouse. Processes in place to safeguard the confidentiality of customer data are also strictly enforced.

Access to the physical hardware is restricted by server placement and by limiting the number of trusted personnel who have appropriate card-key access and training. The equipment is under video surveillance 24 hours a day, 7 days a week. Network access is restricted by isolation from the Internet and by requiring Active Directory™ service authentication before a user can log on to the network. Each server in the warehouse has its file system under control of an Access Control List (ACL) to ensure further authentication by Active Directory before access to the file system can occur.

SQL Server 2000 Enterprise Edition secures customer data in each database by requiring internal CRM warehouse users to obtain a special account and by authenticating these accounts with Active Directory.

By the design of the CRM warehouse, decision-makers gain access to customer data through data marts. Access to sensitive data within the data marts is limited to the job function of each decision-maker. For example decision-makers who are not required to run campaigns have limited access so that they can perform analyses without actually pulling lists containing customer names and addresses.

Building the Team

To plan and implement the Microsoft CRM and DSS distributed data warehouse, the company formed a team of peers based on their knowledge and skills in six separate areas: storage engineering, product management, program management, design, testing, and production support. Figure 9 illustrates the team of peers used to design, implement, and deploy the distributed CRM and DSS warehouse.

Figure 9: Team of peers used to deploy the warehouse

Figure 9: Team of peers used to deploy the warehouse

Storage Engineering

Requirements. The project needed experts in storage engineering who could design and implement a large SAN. Such professionals would deploy back-end equipment enabling designers to scale the warehouse to enormous proportion. Storage engineers also would serve to advise other team members on effective methods of supporting the environment using Microsoft Cluster Service.

Personnel. Two storage engineers joined the team and brought with them expertise in Logical Unit Number (LUN) masking, Storage Area Networking, hardware benchmarking, and Windows 2000 Datacenter Server.

Product Management

Requirements. The project needed individuals who could thoroughly understand business requirements and help to ensure that the warehouse would later satisfy company-wide objectives. Such individuals would need to be able to meet with corporate decision-makers and capture what they considered vital information requirements.

In seeking individuals to fill the role of product manager, leaders looked for the ability to specify complex business process flows, determine data sources and technology architectures, and generate and deliver technical presentations. They also looked for experience in Microsoft products including SQL Server 2000, Windows 2000 Advanced Server, and Microsoft Office.

Personnel. Several individuals who met these qualifications joined the team as product managers and helped to drive the big picture. These results-oriented technical professionals became invigorated by identifying product features and functionality and by gathering and analyzing business requirements and product strategies. They also delegated assignments, proactively incorporated requirements into the product-planning process, sought user feedback, and drove warehouse design.

Program Management

Requirements. The project needed individuals who could carry out work according to a common schedule and project plans containing hundreds of different steps as required by the complex design and deployment of the warehouse. Such individuals would need to keep track of all project dependencies and to translate business requirements into functional specifications, process flows, and data models.

In seeking individuals to fill the role of program manager, project leaders looked for five years of proven management experience in an information-technology setting, work in technology support, and the ability to perform hands-on and managerial roles simultaneously. They also sought people with knowledge of client/server computing, excellent written and verbal communications skills, and familiarity with Microsoft products including SQL Server and Windows 2000 Datacenter Server.

Personnel. Several individuals who met the qualifications joined the team as program managers and helped to drive the development timeline, analyze business requirements, develop account-management strategies, and manage multidisciplinary teams toward a common objective. These people helped to drive the big picture by partnering with internal groups to design, develop, and implement the data warehouse. They also contributed by communicating business problems, alternatives, and project strategies to various work groups while managing relationships between team members and modeling strong team building.

Design

Requirements. In seeking individuals who could handle the design aspects of the data warehouse, project leaders looked for expertise in relational star schemas, acceleration tables, and aggregation and skills in implementing highly efficient code using SQL.

Personnel. The warehouse designers are by far the most technical individuals on the team. They are skilled at developing medium- to high-risk business solutions under deadline pressure while implementing specifications that result in easily maintainable code. They are responsible for successful completion of projects in conformance to project goals and requirements and are able to defend design requirements, feature sets, and functionality to other team members.

These individuals also possess strong written and verbal communications skills, a demonstrated technical ability, in-depth knowledge of software-development processes, and the ability to work closely with program managers, testers, and production support personnel. In addition they have worked extensively with Microsoft SQL Server, relational and logical database design, coding SQL, and performance tuning and are knowledgeable in COM+/DNA, OLE, XML/XSL, MTS, Visual Basic® Scripting Edition (VBScript), JScript® development software, Visual Basic, DHTML, ASP, ADO as well as Internet Information Service (the Web server in Windows 2000 Server).

Testing

Requirements. To ensure that each functional capability would not introduce any instability into the data warehouse, project leaders needed testing personnel sufficient to test each new component as it was developed. Testers would need to be well-versed in nearly every aspect of the warehouse and bring to the project a "zero defect" mindset. They would need to be prepared to analyze and duplicate, in a very controlled manner, any problem that might affect the stability of the warehouse so the warehouse design team could quickly resolve the cause of the problem.

Personnel. Testers work closely with designers to prevent problems and to make the product more testable. They define and execute test cases by analyzing specifications and writing automated scripts for testing warehouse, user interface, and database functionality. Testers also possess demonstrated testing skills, an in-depth understanding of software development and SQL Server, and the ability to analyze coding changes made in Active Server Pages and SQL.

Production Support

Requirements. The project needed people responsible for ongoing day-to-day production support to ensure that simple problems could be resolved before they might adversely affect the overall data warehouse. Other essential skills were an ability to troubleshoot, to resolve problems quickly on the basis of limited information, and to develop monitoring techniques specific to data-warehouse support.

Personnel. Individuals who provide production support on the data warehouse are widely familiar with Microsoft products including Windows 2000 Datacenter Server; Windows 2000 Advanced Server and its Web server, Internet Information Service; and SQL Server 2000. They are typically able to develop command-line scripts in VBScript, JScript, and/or Windows NT® Batch. They also understand SQL optimization and coding techniques; are familiar with C++, networking fundamentals, and system tuning and timing; and are considered expert troubleshooters.

Production-support people also work closely with network engineers, storage engineers, developers, testers, program managers, and operations teams including database administrators. They possess managerial as well as problem-solving skills and a demonstrated ability to learn new technologies quickly.

Lessons Learned

The team responsible for design, deployment, and support of the Microsoft CRM and DSS data warehouse encountered a number of challenges before they were ready to consider the warehouse a success. Here are the primary lessons they learned:

Few records actually need archival. Initially, designers believed that data archival would be an important aspect of the warehouse design. Later they learned that only a small percentage of records actually needed archival and that the best way to deal with those records was through data suppression.

Unicode should have been better used. Designers now believe that they should have made better use of Unicode in their original design. They believe that had they implemented Unicode support extensively throughout the design, they would now find it easier to support multiple code pages.

Copying databases between computers did not scale. Initially, the production-support team copied SQL Server databases between computers so that several computers running SQL Server could share some of the data. But they subsequently learned that copying the databases across the network took too long and that using a SAN would be a better approach to sharing data between computers. With a SAN, data is written directly to disk by going through fiber-channel and hardware backplane at speeds of up to 100 megabytes per second. Moving data between servers using a SAN reduced the time required to transfer data between servers from more than three days to just a few hours.

Timing is vital to keep the data moving. Early in the implementation, the production-support team worried that contention for clustered disks might cause problems when attempting to share data using a shared disk resource. To overcome this concern, team members developed a SQL process to track and manage the assignment of shared disk resources by providing a common interface to those resources. With the help of this process, a server requiring a shared disk resource is provided that resource simply by calling the common interface. Servers that require the shared volume submit a request to the custom application and then go into a SQL wait loop until failover has been applied to the drives. The common interface prevents a server in the warehouse from taking a shared drive array away from another server while it is writing data to the drive.

Storage engineering is a requirement. A number of years ago, ITG did not have its own dedicated group of storage engineers. Such people are now required and regularly called upon to deploy SAN-based solutions to house the ever-growing information requirements of the company.

Conclusion

Decision-makers from small, medium, and large companies alike depend on having accurate and up-to-the-minute information in a format that enables them to retain or regain business agility. They must have the information they need quickly to favorably tune their business processes in response to ever-changing market conditions, to understand customer requirements, and to manage and nurture valuable customer relationships.

Microsoft has designed and deployed a distributed data warehouse toward the goal of gaining a more comprehensive view of its customers and thereby improving customer satisfaction and enhancing customer loyalty. By presenting the story of the Microsoft CRM and DSS data-warehouse project in this document, ITG hopes its own customers with similar goals can gain insight about how they might do the same for their organizations.

For More Information

More information on this topic and related topics is available from the following sources:

For the latest information on Microsoft Windows 2000 Advanced Server and Windows 2000 Professional go to https://www.microsoft.com/windows

For the latest information on Microsoft SQL Server 2000 go to https://www.microsoft.com/sql/

For the latest information on Microsoft Windows 2000 Datacenter Server go to https://support.microsoft.com/kb/265173/

For other information that illustrates additional solutions go to https://www.microsoft.com/business/default.mspx

To view additional IT Showcase material, go to https://www.microsoft.com/technet/itsolutions/msit/default.mspx

For questions, comments, or suggestions related to this document or to obtain additional information about Microsoft IT Showcase, please e-mail showcase@microsoft.com.

1 The term "customer" represents an end user who has purchased at least one Microsoft product from a Microsoft distributor or reseller, registered for event attendance, or subscribed to an online newsletter. Microsoft's channel relationship requires the company to acquire the name and address of each end user from discrete sources.

2 Windows 2000 Datacenter Server supports clusters up to four nodes. (Windows 2000 Datacenter Server is sold through qualified Windows 2000 Datacenter Server OEMs. Please refer to https://support.microsoft.com/kb/265173/ for a list of qualified OEMs.) As part of continuing product development, warehouse designers and product development groups within Microsoft are evaluating Windows 2000 Datacenter Server in configurations that use more than four nodes. This document illustrates the use of a seven-node cluster that is under evaluation.

Distributed Data Warehousing with Microsoft SQL Server 2000 and Windows 2000 Datacenter Server

On This Page

Executive Summary

Business requirements

Customer Satisfaction

Economy of Scale

Standardization

Overview

Design

Logical Architecture

Physical Architecture

Data Flow

Integration

Implementation

Data Loading

Factory Processing

Data Marts

Support

Windows 2000 Performance Monitor

Supporting Clustered Resources

Security

Building the Team

Storage Engineering

Product Management

Program Management

Design

Testing

Production Support

Lessons Learned

Conclusion

For More Information

Additional resources