Export (0) Print
Expand All

Enterprise Scale Clickstream Analysis Running on SQL Server 2008 R2 Parallel Data Warehouse

Business Case Study

Published: November 2010

When Microsoft wanted to add clickstream data to the company’s enterprise data warehouse (EDW), it found that only about one-sixteenth of the information could be loaded into the EDW each day. To provide a richer data set, the Microsoft Information Technology (Microsoft IT) group deployed a clickstream data warehouse by using Microsoft® SQL Server® 2008 R2 Parallel Data Warehouse, which supports multiple instances of SQL Server running on separate compute nodes. The company can now load complete clickstream data and has found that queries are processed 30 times faster.

Download

Download Business Case Study, Microsoft Word file, 370 KB

Customer Profile

Situation

Solution

Benefits

Based in Redmond, Washington, Microsoft is the worldwide leader in software, services, and Internet technologies for personal and business computing.

Microsoft needed the ability to load 225 GB of compressed clickstream data per day and support queries across some 7 billion rows of data.

The company deployed its clickstream data warehouse by using Microsoft SQL Server 2008 R2 Parallel Data Warehouse running on an appliance with eight compute nodes. Each node has its own processors, memory, storage, and dedicated instance of SQL Server.

  • Faster data loading and transformation
  • 30 times faster query processing
  • Scalability
  • Easier database administration

Situation

Clickstreams—the capture of mouse clicks and keystrokes as users visit a website or online application—can yield valuable business intelligence (BI) that can help an organization to better understand, and more effectively serve, the needs of visitors. But for global organizations like Microsoft, the volume of clickstream data generated in a single day across all of its sites can be overwhelming.

Microsoft has long appreciated the value of using an Enterprise Data Warehouse (EDW) to provide a central repository for information from its operations, which span 89 countries, include some 80,000 employees, and generate annual revenue of more than $62 billion US. The company’s EDW, which hosts data from dozens of internal sources and feeds more than 500 subscribing internal applications, had grown to over 10 terabytes.

The EDW, hosted on the Microsoft Application Platform (including Microsoft SQL Server 2008 R2 Enterprise data management software running on the Windows Server® 2008 R2 Enterprise operating system), provided a robust repository. However, the quantity of clickstream data that the company’s marketing group wanted to add to the EDW challenged Microsoft IT.

Part of the challenge was the complexity of the extract, transform, and load (ETL) processes required to properly sequence clickstream data so that its full value could be gained when analyzed from the EDW. The company’s marketing analysts wanted to be able to do the following:

  • See how deeply users browse into a site

  • Determine where problems appear

  • Analyze clickstream data against marketing data to gauge the effectiveness of programs, promotions, and other elements

By using SQL Server 2008 R2 Integration Services, Microsoft IT created a 15-step ETL process to maximize the value of the clickstream data imported into the EDW. But the volume of clickstream data from Microsoft global operations was overwhelming: Storing complete clickstream data would amount to some 10 terabytes, doubling the size of the company’s EDW. Although raw storage was not a problem, timely ETL processing of so much data was.

Even with compressing 500 gigabytes (GB) of data down to 225 GB, there were not enough hours to perform ETL on all of the clickstream information captured each day. To reduce load times, Microsoft IT imported a sampling of clickstream data, representing about one-sixteenth of the total information captured each day.

“The parallel processing is so efficient that instead of using classic ETL, we used ELT. We load the data before transforming it. We no longer do any pre-processing because it is faster to do everything within the Parallel Data Warehouse.”

Ron Ortloff
Principal Architect
Microsoft Corporation

The problem was not with SQL Server or with Integration Services, but with the inherent scaling limitations of a symmetric multiprocessing (SMP) architecture in which multiple processors access the same shared memory. Microsoft IT wanted to meet the desire of the company’s marketing teams to store complete clickstream data and be able to efficiently query across 90 days of the information—with the largest fact table containing approximately 7 billion rows of data.

To scale to meet the daily ETL loads of full clickstream capture, and to support efficient queries across billions of rows of clickstream data, Microsoft looked at parallel processing that would enable tables with billions of rows to be deployed, loaded and queried efficiently, overcoming the limits of SMP architectureSolution

Solution

The company found the solution that it needed with SQL Server 2008 R2 Parallel Data Warehouse, which uses massively parallel processing (MPP) to deliver high performance and scalability. Deployment and ongoing maintenance are simplified because SQL Server 2008 R2 Parallel Data Warehouse is delivered as an appliance with carefully balanced, pre-assembled hardware and software from leading hardware vendors.

The EDW clickstream solution, currently in pilot phase with over 10 terabytes of data loaded is deployed on a Parallel Data Warehouse appliance with eight compute nodes. Each node is equipped with a dual-core processor, 96 GB of random access memory (RAM), its own dedicated storage, and its own instance of SQL Server.

In traditional SMP architecture, query processing occurs entirely within one physical instance of the database. CPU, memory, and storage impose physical limits upon speed and scale. In contrast, the MPP architecture of SQL Server 2008 R2 Parallel Data Warehouse distributes partitions large tables across multiple physical nodes. Because each node has its dedicated CPU, memory, storage, and instance of SQL Server, it is called a parallel shared-nothing architecture, where each node can function without the constraints of shared resources.

Architectural Notes

SQL Server 2008 R2 Parallel Data Warehouse appliances are delivered with a minimum of two racks: a control rack and a data rack, as shown in Figure 1. Additional data racks can be added to distribute load and increase capacity. The current architecture supports up to 40 compute nodes—either four or five data racks, depending on the vendor.

The control rack includes six servers: two management nodes, two control nodes, the landing zone, and the backup node. Additionally, the control rack ships with dual InfiniBand, Ethernet, and Fibre Channel switches.

Figure 1. Each Parallel Data Warehouse appliance includes a control rack and a data rack

Figure 1. Each Parallel Data Warehouse appliance includes a control rack and a data rack

The functionality of the control rack nodes includes the following:

  • Management node: The management node includes its own instance of Active Directory® Domain Services (AD DS) to facilitate out-of-box functionality of the appliance. The management node also includes the Cluster service for high availability, and high-performance computing, which is used during the initial installation and for ease of management of the nodes within the appliance.

  • Control node: The control node is where user requests for data enter and exit the appliance. On the control nodes, queries are parsed and then sent to compute nodes for processing. However, queries do not always use the control node to stream data. Aggregation compatible queries can stream rows directly from the nodes which eliminates the control node bottleneck.

  • Landing zone: The landing zone is essentially a large file server with SAN storage to provide a staging ground for loading data into the appliance, either through the command line with DWLoader (a program for quickly loading data warehouse from a remote data source) or through use of Integration Services, which has a Parallel Data Warehouse connector.

  • Backup node: Compute nodes can copy data to the backup node in parallel via the high-speed InfiniBand that connects the nodes. From the backup node, organizations can offload their backups through their normal procedures.

The data rack holds the compute nodes and is where all user data is stored and processed during query execution. Each Compute Node is connected to dedicated, independent SAN storage. Each data rack has between 8–10 compute nodes and uses failover clustering to gain high availability. To facilitate failover clustering, a spare node within the rack acts as a passive node in the cluster. Each SQL Server 2008 R2 Parallel Data Warehouse appliance comes with at least one data rack. Additional racks can be added to facilitate scalability. The data rack for the Microsoft IT clickstream EDW deployment has capacity to store 32 TB of EDW's SQL Server data after compression; based on 12.8 TB of RAID configured SAN data capacity on and EDW's compression ratio of 40%. 

Data Loading

The Microsoft IT EDW deployment holds information from more than 50 sources. As shown in Figure 2, the clickstream Data Warehouse combines commercial data (financial, expense, customer relationship management [CRM], and other sources) from the EDW with the clickstream data from Microsoft global websites. Initially, Microsoft IT followed a traditional ETL approach, pre-processing some of the clickstream data before moving it onto the Parallel Data Warehouse appliance. However, the company found that the MPP appliance was so fast that a more efficient approach was to load all of the data onto Parallel Data Warehouse before performing the 15-step ETL process.

 Figure 2. High-level financial data from the EDW is combined with clickstream data and loaded into Parallel Data Warehouse 

Figure 2. High-level financial data from the EDW is combined with clickstream data and loaded into Parallel Data Warehouse

Parallel Data Warehouse feeds a multidimensional cube, hosted on a separate server and created through SQL Server 2008 R2 Analysis Services. The clickstream data mart uses a relational online analytical processing (ROLAP) design to support scalability.

The Data Compression feature of SQL Server 2008 R2 is used throughout the clickstream EDW. It provides a 40 percent compression ratio for the clickstream data and reduces the 300-GB daily feeds to 125 GB. The combination of SQL Server 2008 R2 Parallel Data Warehouse running on the eight-node Dell Parallel Data Warehouse appliance has enabled Microsoft to perform ETL 30 times faster. Microsoft can also now load complete clickstream data, rather than the one-sixteenth slice that it previously used. The MPP computing power of the Parallel Data Warehouse solution has enabled the company to run queries—across the 7 billion rows of data—that would not have been possible with the earlier SMP solution.

“The analysts on the marketing teams can run queries that just weren’t possible when we were running on just a single instance of the database.”

Sailaja Tallamraju
Senior Software Development Engineer
Microsoft Corporation

Benefits

“When analyzing page clicks across 30 million-plus Web pages, the serial processing of an SMP solution was a huge barrier. Queries that used to run in 15 minutes—against much smaller data sets—are now completed within 30 seconds.”

Sailaja Tallamraju Principal Senior Software Development Engineer
Microsoft Corporation

SQL Server 2008 R2 Parallel Data Warehouse provided the foundation needed to create a full clickstream EDW. The company has enjoyed a number of benefits from the deployment, including faster data loading and transformation, 30 times faster query processing, scalability to grow to meet customer needs, and easier database administration.

Faster Data Loading and Transformation

Creating the Clickstream Data Warehouse by using SQL Server 2008 Parallel Data Warehouse enabled Microsoft IT to expand its information base from a 7-day repository holding a one-sixteenth sampling of daily click information to a 90-day repository of complete clickstream data.

Loading complete data was not possible before the Parallel Data Warehouse solution. Experiments that Microsoft IT performed with a single-instance SMP platform showed that processing one day's clickstream data would take more than 24 hours.

The Parallel Data Warehouse appliance, with eight processing nodes, is so efficient at performing ELT on clickstream data that Microsoft IT recommends as a best practice that organizations consider eliminating pre-staging. Rather than using the traditional model of transforming the data before loading it, Microsoft IT suggests organizations load and then transform the data to more fully take advantage of the MPP efficiency of SQL Server Parallel Data Warehouse.

The ability to load and transform a complete set of clickstream data each day, from more than 30 million webpages, gives marketing groups the ability to perform much deeper analytics. The solution also helps analysts trust their findings because they are dealing with a complete data set and not just a random sampling.

30 Times Faster Query Processing

The processing power that comes from spreading queries seamlessly across eight computing nodes—each with its own memory, storage, and instance of SQL Server—has increased query performance to the point where Microsoft can perform queries that previously would not have been possible or practical across billions of rows of data.

Despite having far more data to query against—90 days of complete data compared to 7 days of partial data—Microsoft IT found that the clickstream data warehouse, running eight instances of SQL Server, was able to process queries more than 30 times faster than on the earlier single-instance deployment. Queries that required 15 minutes to run with the old system were completed within 30 seconds on the new clickstream EDW running on SQL Server 2008 R2 Parallel Data Warehouse.

The ability to run faster queries across vastly more data is enabling Microsoft analysts to look at clickstream data in new ways, enhancing the BI that they can derive from the information.

Scalability

Although the clickstream EDW now stores more than 10 terabytes of clickstream data, Microsoft IT is confident in the ability to scale to meet future needs by adding data racks to its parallel data warehouse appliance. The clickstream data warehouse is currently hosted on 8 compute nodes, and SQL Server Parallel Data Warehouse supports up to 40 nodes.

Easier Database Administration

“Our PDW appliance is configured with eight compute nodes, but SQL Server Parallel Data Warehouse supports up to 40 nodes. We assume there will be even larger appliance sizes in the future.”

Ron Ortloff
Principal Architect
Microsoft Corporation

The fact that SQL Server 2008 R2 Parallel Data Warehouse is delivered as part of a carefully balanced appliance simplifies deployment and maintenance. Parallel Data Warehouse includes an administrative console that provides an overview of what is happening across the system. Administrators can look for long-running queries, for example, or check log information, and at any point click for more detailed information.

The Parallel Data Warehouse administrative console provides an intuitive interface and dashboard for identifying and resolving issues. Database administration is simplified because the automated storage management replaces complex space allocation, and diagnostics detect and broadcast hardware issues.

The greatest administrative benefit of Parallel Data Warehouse, though, is what does not have to be managed. Microsoft IT was impressed by how the Parallel Data Warehouse transparently handles breaking large queries up into pieces that can be processed on individual compute nodes, and also transparently manages the process of distributing data across all of the nodes within the appliance—without requiring any administrative action.

For More Information

For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Order Centre at (800) 933-4750. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information via the World Wide Web, go to:

http://www.microsoft.com

http://www.microsoft.com/technet/itshowcase

© 2010 Microsoft Corporation. All rights reserved.

Microsoft, Active Directory, SQL Server, and Windows Server are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners.

This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.

Was this page helpful?
(1500 characters remaining)
Thank you for your feedback
Show:
© 2014 Microsoft