Export (0) Print
Expand All

Hosting 90 Terabytes by Using SQL Server 2008 R2 Parallel Data Warehouse

Article

Published: November 2010

When the Microsoft network security team needed to scale its data store to 90 terabytes, while increasing already strained data loading beyond 1 terabyte a day and improving query performance, it turned to Microsoft® SQL Server® 2008 R2 Parallel Data Warehouse. Parallel Data Warehouse enables large tables to be spread across multiple physical nodes. Each node has dedicated CPU, memory, and storage, and each runs its own instance of SQL Server. The resulting computing power enables the team to process up to 4 terabytes of new information per day into its data warehouse. The team has seen a twentyfold improvement in query performance as a result.

Download

Download Article , 207 KB, Microsoft Word file

Download Video

Intended Audience

Products & Technologies

IT business managers, IT technical managers, SQL Server administrators, IT security specialists

  • Windows Server 2008 R2 Enterprise
  • Microsoft SQL Server 2008 R2 Parallel Data Warehouse
  • Microsoft SQL Server 2008 R2 Integration Services

Introduction

Microsoft is similar to other global enterprises in its need to safeguard a large base of IT resources. With some 96,000 employees operating in more than 100 countries, Microsoft must safeguard more than 250,000 computers worldwide. A key element of the safeguarding effort is performed by the company’s network security team, which analyzes network utilization events captured from various sources, including more than 100 proxy servers, mail servers, Net Logon servers, and related devices throughout the company’s worldwide operations.

The network security team uses an internally developed monitoring solution that records specific forensic events to provide post-incident tracking and research tools for investigation and evidence collection. This application, Information Security Consolidated Event Management (ICE) 4.0, was holding up to 60 terabytes of information in a data warehouse to support reporting and querying.

Analysis of the proxy data empowers the team to identify and remediate numerous security issues that would have gone undetected otherwise. But with the ICE solution capturing approximately 1 terabyte of log data each day, the data loading process, which involves performing extract, transform, and load (ETL) processes on flat file data, sometimes overwhelmed the system, causing loading delays.

Adding to the stress on the ICE data warehouse solution was the Microsoft goal to move from storing 60 days of log data to 90 days. The network security team had critically important needs:

  • Enhance scalability. The existing implementation of ICE was beginning to reach the limits of scalability, including ETL throughput to meet the demands of the larger data warehouse required to support 90 days of data.

  • Improve query performance. The complexity of queries created by security specialists at Microsoft, combined with the vastness of the data warehouse, meant that some queries took more than a day to run. With the database growing to more than 90 terabytes, the team needed to improve query performance.

  • Support business intelligence (BI). The network security team and other security groups within Microsoft have an increasing need to extract advanced BI from the ICE data warehouse to enable earlier detection of trends and threats, and thus enable proactive measures.

Microsoft found the solution that it needed by upgrading its ICE 4.0 infrastructure to SQL Server 2008 R2 Parallel Data Warehouse.

Solution

The network security team participated in the SQL Server 2008 R2 Parallel Data Warehouse Technology Adoption Program (TAP) because of the product’s use of massively parallel processing (MPP), and its ability to support multiple instances of SQL Server.

Microsoft had been well served by its ICE infrastructure deployed through Microsoft SQL Server 2008 Enterprise (64-bit), hosted on a server with four processors and 64 gigabytes (GB) of random access memory (RAM), but felt limited because of the symmetric multiprocessing (SMP) architecture of its data warehouse. The team also liked that SQL Server 2008 R2 Parallel Data Warehouse is an appliance solution that uses inexpensive, industry-standard hardware. The ICE data warehouse runs on the Windows Server® 2008 R2 Enterprise operating system. As shown in Figure 1, ETL performance is essential because about 1 terabyte of log data is captured each day and loaded into the data warehouse.

Figure 1. Performing ETL on log feeds

Figure 1. Performing ETL on log feeds

Support for Multiple SQL Server Instances

In traditional SMP architecture, query processing occurs entirely within one physical instance of a database. CPU, memory, and storage impose physical limits upon speed and scale. In contrast, the MPP architecture of SQL Server 2008 R2 Parallel Data Warehouse partitions large tables across multiple physical nodes. Each node has dedicated CPU, memory, and storage, and each runs its own instance of SQL Server, in a parallel shared-nothing architecture. SQL Server 2008 R2 Parallel Data Warehouse delivers improved performance through its unique design as follows:

  • All components are balanced against each other to reduce performance bottlenecks, and all server and storage components are mirrored for enterprise-class redundancy.

  • A control node routes queries from applications to all compute nodes, and then collects and returns the result.

  • Because data is evenly distributed across multiple nodes and processing occurs in parallel, queries can be many times faster than on single SMP database servers.

The ICE 4.0 data warehouse is deployed as a two-rack system with 16 compute nodes. Each node has 64 GB of RAM, and each hosts its own instance of SQL Server 2008 R2 Parallel Data Warehouse. Data is stored on a storage area network (SAN), deployed through redundant Fibre Channel network fabrics.

Deployment and ongoing maintenance are simplified because SQL Server 2008 R2 Parallel Data Warehouse is delivered as an appliance with carefully balanced, preassembled hardware and software from leading hardware vendors.

Scalability to Support ICE

Although ICE 4.0 is an internal tool that Microsoft developed, it provides functionality similar to solutions of interest to other enterprises. It also serves as a model that displays the scalability of Parallel Data Warehouse. ICE captures, filters, stores, and supports queries of security events that occur across the Microsoft global corporate network.

The current implementation captures the following types of information:

  • E-mail events from servers running Microsoft Exchange Server

  • Logon events from Windows Server domain controllers

  • Web browsing and firewall events from servers running Microsoft Forefront™ Threat Management Gateway

  • Event properties such as locations, servers, and times

The ability of Parallel Data Warehouse to distribute single tables across multiple compute nodes and SQL Server instances enhanced scalability and query performance at Microsoft.

Benefits

SQL Server 2008 R2 Parallel Data Warehouse offers a range of benefits, including:

  • Scalability. As the network security team prepared to move from storing 60 days of log data to 90 days, its TAP program testing found that the Parallel Data Warehouse solution—Using SQL Server 2008 Integration Services—could easily scale from handling ETL loads of 1 terabyte a day to handling ETL loads of up to 4 terabytes a day. The scalability is important to the team because with the earlier SMP solution, processing the daily 1-terabyte loads in a timely manner was sometimes difficult.

  • Faster query performance. After upgrading to Parallel Data Warehouse, the network security team found that complex queries ran an average of 20 times faster than on the earlier implementation, and in some cases ran 64 times faster.

  • High concurrency. Parallel Data Warehouse uses a landing-zone architecture, which speeds ETL processing and contributes to faster query processing. A Parallel Data Warehouse appliance possesses an externally accessible storage host called the landing zone. Whereas a traditional, single-server SMP SQL instance might experience resource contention during scenarios of heavy data loads, the landing-zone feature of the Parallel Data Warehouse architecture enables query performance to continue without impact.

  • Business intelligence. The network security team plans to pull BI from the ICE information, taking advantage of the same MPP architecture that enables Parallel Data Warehouse to support faster query performance to provide the computational power required to perform BI analysis. The team sees promise in using BI to identify potential threats, traffic patterns, and other information that was not practical to attempt with the earlier SMP solution.

  • Seamless deployment. Deployment of Parallel Data Warehouse proved to be easy for the network security team. The team credits this ease to the fact that Parallel Data Warehouse appears to applications as simply a single instance of SQL Server. The team was delighted that developers did not need to know how many SQL Server instances were deployed as part of the ICE Parallel Data Warehouse solution.

Lessons Learned and Best Practices

  • ETL package optimization. As part of its upgrade to Parallel Data Warehouse, the network security team took the opportunity to examine its existing SQL Server Integration Services packages to ensure that they were as efficient as possible.

  • ODBC updates. The network security team updated its Open Database Connectivity (ODBC) drivers to those for Parallel Data Warehouse, and it updated the ODBC references in its ETL packages.

  • Shadow deployment. The network security team deployed the Parallel Data Warehouse version of ICE 4.0 on new hardware, while the older version continued to run. This enabled the team to test and tune the new platform before retiring the older version.

Conclusion

SQL Server 2008 R2 Parallel Data Warehouse represents a significant advancement because of its ability to partition large tables across multiple nodes,. Parallel Data Warehouse gives the Microsoft network security team the scalability required to support ETL processing far beyond its current needs (up to 4 terabytes a day) and the computational power to reduce query times by several orders of magnitude. Any organization that is dealing with multi-terabyte relational data stores could benefit from the MPP architecture of SQL Server 2008 R2 Parallel Data Warehouse.

For More Information

For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Order Centre at (800) 933-4750. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information via the World Wide Web, go to:

http://www.microsoft.com

http://www.microsoft.com/technet/itshowcase

© 2010 Microsoft Corporation. All rights reserved.

This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft, SQL Server, and Windows Server are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners.

Was this page helpful?
(1500 characters remaining)
Thank you for your feedback
Show:
© 2014 Microsoft