4 out of 4 rated this helpful - Rate this topic

How Microsoft IT Uses SQL Server 2005 to Power a Global Forensic Data Security Tool

Technical White Paper

Published: November 2, 2007

The Information Security Consolidated Event Management System (ICE 3.0)

Download

Download Technical White Paper, 808 KB, Microsoft Word file

Download PowerPoint Presentation, 2.70 MB, Microsoft PowerPoint file

Download IT Pro Webcast, WMA, MP3

Download TechNet Radio

Situation

Solution

Benefits

Products & Technologies

Microsoft constructed a global forensic security system using the features of Microsoft SQL Server 2005. This paper discusses the business case, project, and resulting product that Microsoft partners and customers can use as proof of concept when planning large data management implementations.

ICE 3.0, the next generation of the custom Microsoft forensic event management system, is specifically designed to collect, filter, and house significant events for research and evidentiary use by forensic security engineers.

ICE 3.0 features the following benefits:

  • Provides significant performance gains compared to the previous system.
  • Increases data capacity by a factor of four compared to the previous system.
  • Provides a re-architected approach to ensure data availability for ad hoc and preconfigured queries.
  • Leverages the features of Microsoft SQL Server 2005.
  • Microsoft SQL Server 2005
  • Microsoft Windows Server 2003
  • Very Large Databases (VLDBs)
  • Storage Area Networks (SANs)

Executive Summary

The purpose of this white paper is to share architecture, design, and deployment considerations of and discuss the knowledge gained during the Microsoft implementation of SQL Server 2005 to power a global forensic security tool. Additionally, this paper demonstrates the value of current Microsoft products for capturing, filtering, organizing, storing, and analyzing network event data from a widely dispersed, complex corporate environment. This paper introduces the technologies employed, discusses the business case for the project, and describes the development and deployment effort to implement the solution.

Many of the principles and concepts described in this paper can be employed to develop systems that capture and warehouse very large volumes of data within any organization. Similarly, the design considerations for data capture and storage infrastructures can be applied to most enterprise-scale IT environments through Microsoft products. However, this paper is based on Microsoft Information Technology's (IT) experience and recommendations as an innovator. It is not intended to serve as a procedural guide. Each enterprise environment has unique circumstances; therefore, each organization should adapt the plans and lessons described in this paper to meet its specific needs.

Note: For security reasons, the sample names of servers, tables, internal resources, organizations, and internally developed security systems that are used in this paper do not represent real resource names that are used within Microsoft and are for illustration only.

Introduction

Audience

This paper assumes that readers are technical decision makers who are already familiar with Microsoft SQL Server 2005, including SQL Server Integration Services (SSIS). This paper also assumes a basic understanding of storage technologies such as Very Large Database Systems (VLDBs) and Storage Area Networks (SANs). This paper also discusses the concept of global forensic security systems.

Note: A TechNet Webcast on this topic was delivered on June 19, 2007. This Webcast is available for download from Microsoft TechNet. You can also download a Podcast of this event at http://www.microsoft.com/downloads

Technology Overview

Very Large Database (VLDB)

A VLDB is a relational database system that contains an extraordinarily large set of data. The term VLDB can also be used to describe a database that handles a large number of complex transactions per second. Implementing and maintaining VLDBs presents unique hardware, software, and network challenges, including the need for very large supporting storage systems to house vast quantities of data.

Storage Area Network (SAN)

A SAN is a dedicated network architecture designed to provide storage services. SANs make large amounts of storage available to servers, appearing as a local device to the server. Typically, SANs are used to support transactional data management systems that require high-speed access to the data housed on the Logical Unit Numbers (LUNs).

Note: The LUN is the identifier that is used to address a disk or disk partition within a SAN.

Forensic Event Systems

Forensic event systems monitor network traffic and selectively record specific events to provide post-incident tracking and research tools for investigation and evidence collection. Because each corporation's network, business, and legal outlook is unique, systems must be customized to capture information pertinent to that business and to make that data available to forensic security engineers for evaluation.

SQL Server 2005 Integration Services (SSIS)

SSIS is a toolset included with Microsoft SQL Server 2005 that provides a platform to develop custom applications for data management. With SSIS, developers can build packages to update data warehouses, to interact with external processes, to clean and mine data, to process analytic objects, and to perform administrative tasks.

Note: For more information about SSIS, visit the "Introduction to SQL Server 2005 Integration Services" Web page at http://www.microsoft.com/technet/prodtechnol/sql/2005/intro2is.mspx.

 

 

Introduction to the Information Security Consolidated Event Management System

ICE: An Overview

The Information Security Consolidated Event Management System (ICE) is an internal global forensic security tool developed by Microsoft for the Information Security (InfoSec) team to capture, filter, store, and query security events that occur across the Microsoft global corporate network.

ICE collects incoming and outgoing events on the Microsoft global corporate network 24 hours per day, seven days per week. The current implementation captures the following types of information:

  • E-mail events from Microsoft Exchange servers
  • Logon events from Microsoft Windows Server domain controllers
  • Web browsing and firewall events from Microsoft ISA Servers (Proxy servers)
  • Event properties such as locations, servers, and times

The number of events and their frequency require that a system be able to manage very large amounts of data and to use sophisticated hardware and software solutions such as SANs and VLDBs.

More than 300 million events are captured and stored daily in a single database that can be queried and analyzed by forensic security engineers when they investigate events for forensic evidence.

ICE User Profile

The ICE system is used by the following two distinct sets of consumers:

  • Security team: The Security team queries the database for forensic information to investigate activity by user aliases, IP addresses, specific URLs, and other data points. The Security team is made up of the following three teams:
  • Tier 1: Tier 1 scans new data each day to identify any suspicious activity.
  • Tier 2: Tier 2 investigates cases opened by Tier 1. Tier 2 tries to determine the source of the activity and whether there is sufficient evidence to visit the machine directly.
  • Tier 3: Tier 3 becomes involved in Tier 1 or Tier 2 cases if the data suggests that an intentional breach of policy or malicious activity has occurred. The information collected by Tier 3 is used to assist the legal teams.
  • Production Support team: The Production Support team maintains the application that feeds data into the database, manages new servers, and handles data processing issues.

ICE: Business Case

The previous Microsoft forensic event system, ICE 2.0, was based on Microsoft SQL Server 2000 and could not scale to manage the growing amounts of data experienced on the Microsoft global corporate network. ICE 2.0 did not provide features desired by data consumers and database administrators and posed the following challenges:

  • Data delivery from the desired servers across the global corporate network was inconsistent. ICE 2.0 used a pull mechanism to retrieve event data from all desired servers across the global corporate network.
  • SQL Server 2000 did not provide the required system of error notification. This necessitated frequent manual intervention if errors occurred in the data when loading or processing that data.
  • ICE 2.0 was not designed to use indexes for query optimization; therefore, queries could take six to seven hours to process. Carelessly formatted wildcard queries could cause the ICE 2.0 system to stop responding entirely.
  • The system did not effectively handle concurrent actions. Simultaneous requests to load and query data frequently blocked the whole system.
  • The amount of data that could be captured daily from all the global servers was limited to approximately 120 gigabytes (GB).
  • Data could only be retained for 12 to 15 days.
  • The maximum storage capacity of ICE 2.0 was approximately 2.5 terabytes.
  • These challenges hindered the effectiveness of data capture and impaired the ability of forensic security engineers to access and use the data.

Required Features

After analyzing the system and identifying its challenges, the Microsoft InfoSec team determined that a new version of ICE was required to manage and monitor events on the growing Microsoft global corporate network. To address this business need, the following required features were identified for the new version of ICE:

  • Data integrity: Provide a reliable mechanism for retrieving and loading information into the database supported by a comprehensive error-reporting system.
  • Data availability: Maintain all database records for at least 60 days.
  • Scalability: Fulfill the current forensic event management needs, and grow to support future needs.
  • Performance: Run queries 50% faster than the existing ICE 2.0 System does.
  • Wildcard searches: Quickly compose and run ad hoc queries designed by forensic security engineers during investigations without affecting other aspects of the system.

The ICE 3.0 Project

Project Objectives

The goal of ICE 3.0 was to address the challenges faced by the existing ICE 2.0 system as described earlier in this paper. The focus of the project was reliability, scalability, and improved schema design to facilitate faster access to the data.

An additional opportunity was identified during the project-planning phase. The initiation of the ICE 3.0 project coincided with the release of SQL Server 2005. Therefore, from its inception, ICE 3.0 was designed to exploit the new capabilities of SQL Server 2005, focusing on the following features:

  • SQL Server Integration Services (SSIS), which could increase reliability and speed of the data feed during both processing and loading.
  • The newly introduced partition tables, which could increase database efficiency and reduce query times by more than 50%.
  • SQL Server Reporting Services (SSRS), which could provide standardized reports for the forensic security engineers to reduce ad hoc querying.

With these requirements, ICE 3.0 provided the perfect proof of concept to demonstrate VLDB capabilities of SQL Server 2005.

Project Approach

Goals

The goals for the ICE 3.0 project were organized into three categories by using the easy-to-remember acronym "SQL" (storage, query, and load).

Table 1. Project Goals

Goal

Specifics

Storage-related

  • Design the data load capacity to support over 150 GB of data per day (later increased to 550 GB); a four-fold increase over ICE 2.0.
  • Retain data for a minimum of 60 days, a four-fold increase over ICE 2.0; an outside goal of 90 days was also considered.
  • Increase database scalability to approximately 30 terabytes.
  • Include Coordinated Universal Time (Greenwich Mean Time, UTC) in all log files.
  • Use UTC for all information stored in the database.

Query-related

  • Re-architect database schema on SQL Server 2005 for better performance for queries and analysis.
  • Provide standardized reports for forensic security engineers to reduce ad hoc querying.
  • Support as many as 25 concurrent users.
  • Support the Unicode character set for all processing.

Load-related

  • Redesign the feed system, creating a reliable method for gathering the logs that contain all the global events each day.
  • Provide a robust notification system for any errors that occurred in the data, in event attributes, or in processing.
  • Fortify the system to support capturing event traffic from more than 93 global Microsoft ISA Proxy servers, more than 15 Exchange e-mail servers, and global employee network logon events.

Out-of-Scope Elements

The ICE 3.0 project focused on the process of obtaining and using data within an immediate and continuous timeframe. Storing and retrieving historical data was not within the scope of the system nor the ICE 3.0 project. Additionally, backup and data recovery of the database were not included as part of this project.

Geographic Scope

Like the ICE 2.0 system, ICE 3.0 would collect events from across the Microsoft global corporate network.

Timelines

The ICE 3.0 project took less than one year to complete; it was initiated in September 2005 and went live in July 2006.

The project was divided into the following four phases.

Table 2. The Four Project Phases

Phase

Duration

Details

Envision and design

1.5 months

Completed business case; defined goals, resources, and times

Development

5.5 months

Completed by using the Software Engineering Institute's (SEI) Team Software Process and Personal Software Process (TSP-PSP)*

System testing

1 month

Completed by using SEI's TSP-PSP*

Stabilization and deployment

2 months

Included user acceptance testing

*For more information about SEI Software Processes, visit www.sei.cmu.edu/tsp/.

Team

The ICE 3.0 project was a collaborative venture designed and completed among multiple teams at Microsoft, including the following:

  • Cross IT (XIT) group
  • SQL Server Products group
  • Information Security (InfoSec) group
  • Network Security (NetSec) group
  • IT Utility Services group
  • Application Consulting and Engineering (ACE) group

Hardware and Sizing

To determine hardware requirements, estimated data points were used during planning, based on information collected from other internal projects. Additionally, an internal Microsoft team, Rightsizing, which specializes in determining and recommending server platforms that meet business requirements for internal implementations, assisted in determining the optimal hardware configuration for this project. Forecasts for load volumes and throughput were used to determine appropriate hardware recommendations.

Based on these forecasts, the required hardware was identified and purchased early in the project, allowing the IT team to use the hardware for staging and testing during development. The hardware identified successfully carried ICE 3.0 through development and into production.

Note: Among other tools, the Rightsizing team uses Performance Monitor (perfmon) to determine the location of bottlenecks on a server and then design an appropriate hardware solution.

The following table describes the hardware that is currently in production for ICE 3.0.

Table 3. ICE 3.0 Hardware Details

Role

Hardware Details

ICE SQL Server

Windows Server 2003 x64

4x2.2GHz CPU

32 GB RAM

 

ICE Application Server

Windows Server 2003 x64

4x2.2GHz CPU

8 GB RAM

 

Constraints

As with any infrastructure project, the main constraints were related to the budget and timeline. When selecting the hardware, budget constraints necessitated the use of quad-processor, single way systems. Time challenges included creating and testing an untried solution to handle an undetermined number of volumes without production-like development, test, or UAT environments. The team had to design the system, extrapolate the test observations, and then take a leap of faith, hoping that their approach would bring success.

Design and Development Approach

The approach to developing ICE 3.0 needed to be holistic, integrating all the following elements:

  • Application servers, hardware
  • Storage servers, hardware, SAN
  • Network infrastructure and interfaces

The Development team used proof of concepts, architecture approach, and design approach when addressing key goals such as utilizing multiple instances of SSIS and removing the query-blocking issues. Other options considered included Bulk Copy Program (BCP), bulk inserts, and SQL Server full-text searches.

Technologies

As mentioned earlier in this paper, ICE 3.0 was designed to exploit the new capabilities of SQL Server 2005, focusing on the following:

  • SQL Server Integration Services (SSIS), which could increase reliability and speed of the data feed during both processing and loading.
  • The newly introduced partition tables, which could increase database efficiency and reduce query times by more than 50%.
  • SQL Server Reporting Services (SSRS), which could provide standardized reports for the forensic security engineers to reduce ad hoc querying.

ICE 3.0 Application Architecture

ICE 3.0 was designed using the N-Tier approach, as illustrated in the following diagram.

Bb961995.image001(en-us,TechNet.10).jpg

Figure 1. Basic ICE 3.0 Architecture

Process Flow

Bb961995.image002(en-us,TechNet.10).jpg

Figure 2. ICE 3.0 Process Flow View 1

Table 4. ICE 3.0 Process Flow

Step

Process Step

Description

1

Proxy, e-mail, and logon servers log data

E-mail, logon, and Web browsing events are logged by servers on the Microsoft global corporate network into their flat files.

2

Collect data

Flat files are stored in file shares on their respective servers.

3

Extract data

By using a push process, flat files from each server's file share are extracted and then copied over the Microsoft global corporate network to a centralized file share on the ICE Server.

4

Preprocess and transform data

The loading agent detects the new files, filters the events for desired data (reconfigurable as determined by the forensic security team), and then transforms the data into formats required by the database.*

5

Load data

Filtered event and attribute data is loaded into partition tables in the ICE database.

6

Index

Tables are indexed to improve query response time.

7

Query

Data is housed in the ICE database and can be queried or included in standardized reports used by the Microsoft InfoSec team.

*After being processed, log files are moved to a separate folder and retained for a short period before being deleted.

 

Bb961995.image003(en-us,TechNet.10).jpg

Figure 3. ICE 3.0 Process Flow View 2

Load Balancer

The data entering the ICE 3.0 system comes in indiscrete masses based on the traffic patterns, work hours, geographic location, and loads experienced on the global corporate network. To handle the very large amounts of data quickly and effectively without data loss, ICE 3.0 depends primarily on a dedicated Load Balancer control program.

The Load Balancer handles the complete dataflow process and was developed to respond to changing data patterns. The Load Balancer is responsible for handling the very large amounts of data that enter the ICE system daily; it is engineered to easily handle 550 GB of data per day.

The Load Balancer is responsible for monitoring the central file share area for new flat files deposited by remote server groups. When new files are detected, the Load Balancer transforms the files into events and attributes, which are then loaded into the ICE 3.0 database.

Server Groups

For manageability, the upstream Proxy servers push their log files into the ICE 3.0 system. Then, the servers are grouped based on region and/or data volumes. For example, two Proxy servers from Redmond could each constitute a server group; a second server group could contain European Proxy servers; and a third group for the Proxy servers in Asia. Additional server groups contain the remaining types of servers, e-mail, and domain controllers. For this implementation of ICE 3.0, five server groups were used.

Note: The data returned in log files is not geographically uniform; for example, the amount of data that is collected from Redmond servers is far larger than other locations. Servers are first grouped to balance the data volume, and then they are grouped by geography.

Successive multi-day data was collected from 93 Microsoft ISA Proxy servers; using SSIS foreach enumerators provided an efficient way to loop through multiple flat files and folders to process the data from different servers.

SSIS: Process and Transform Data

ICE 3.0 is not a multi-threaded application but is a multi-instance application. Multi-instancing, supported by SSIS in SQL Server 2005, enables ICE to seamlessly respond to changes in incoming data volume by dynamically starting multiple instances of SSIS packages.

ICE runs one instance of SSIS for each server group; each instance can respond to as much as 6 GB of incoming data per hour from that server group. If the amount of data that is received from that group exceeds 6 GB, a new instance of SSIS is automatically started to respond to the overflow of data. As many as five additional instances for the two log types, to a maximum of twenty instances, can be started. The Load Balancer acts as the control application for all instances of SSIS.

The Load Balancer has been tested to load 40-60 GB of data per hour into the database.

Note: Specific throughput details are available in Appendix B, "Processing Values from Production ICE 3.0 System."

SSIS supports the use of a configurable, conditional split operation for filtering incoming events. This simple, code-free solution reduced the amount of unwanted data entering the database by 30%, retaining only the data considered necessary for forensic data analysis. This solution was easy to implement by using SSIS in SQL Server 2005.


After data filtering is complete, the data is transformed into a format appropriate for loading to the database.

SSIS also includes robust error and exception event mechanisms, which are especially important with very large data files that support generic and specific exceptions such as errors in flat file loading. All error and information messages are logged to a central error information table, which facilitates troubleshooting.

Data Workflow

Microsoft has engineered the ICE 3.0 database to store vast quantities of data; ICE 3.0 handles approximately 300 million Web and firewall Proxy events daily―over 18 billion events over 60 days. Careful design of the data workflow allows ICE to support loading, transforming, and querying this large and complex database. SQL Server 2005 allows ICE 3.0 to use indexed tables in conjunction with partition tables to manage the data and eliminate the need for indexed views.

Tables

Stage Tables

Each instance of SSIS loads data into a scratch area, or staging table, in the database, from which indexes are created. The number of indexes and the columns constituting the indexes are kept in a file that is read to define the indexes that must be created on the tables.

Bb961995.image004(en-us,TechNet.10).jpg

Figure 4. Data Workflow Overview

More specifically, the following process is used to organize data:

1.     Stage tables are created for each source group, for each instance of SSIS, and for each run date and hour.

2.     Stage tables are switched into partition tables, which are created for each source group, for each instance of SSIS, and for each run date.

3.     Views are created for each source group and for each run date.

4.     Consolidated views are created for all source groups for a single run date.

Note: Stage tables eliminated the blocking issues experienced in ICE 2.0, enabling data to be loaded and queried at the same time.


Partition Tables

ICE 3.0 exploits SQL Server 2005's support of partition tables; this allows tables to be segmented and smaller indexes to be created. Partition tables eliminate the need to re-index the entire table when new data is added. When a system is working with millions of events, partition tables make data management far more efficient.

Note: For more information about partition tables, refer to the MSDN article, "Partitioned Tables and Indexes in SQL Server 2005," which can be found at http://msdn2.microsoft.com/en-us/library/ms345146.aspx.

Data Organization

The events collected by ICE 3.0 occur across all time zones included in the Microsoft global corporate network. With time as an attribute of every event collected, time was the common value used to create the partition table structure.

For Proxy data, 24 partition tables were created, each representing a UTC hour to load and organize the global data collected in a single day. Using time to organize the partition table structure also enabled easy management of upper and lower limits. From the partition tables, various views were generated and consolidated. Views exposed the data and helped make the underlying database structure transparent to the forensic security engineer, facilitating and optimizing querying.

The number of proxy events far exceeded the number of e-mail or netlogon events; therefore, different approaches were taken for data organization.

Proxy Data

Bb961995.image005(en-us,TechNet.10).jpg

Figure 5. Sample Table Structure for Proxy Data

The following process outlines how the tables and views are created for Proxy server event data:

1.     Each instance of SSIS loads data into stage tables created for each source group.

2.     Stage tables are switched into partition tables. Twenty-four partition tables have been created for each server source group, instance, and date to enable events from the same day to be loaded into the same table.

3.     Views are created for each server source group, instance, and date.

4.     A consolidated view is provided above these base partition tables. Other views, such as daily and weekday views, are created to enable more efficient query construction when composing queries based on timeframes.


5.     Merged views are created, consolidating daily views for each run date.

6.     Grouped views are created for calendar weeks. Note that while grouped views are being created, new queries are blocked until view creation is complete.

Note: The ratio for both stage and partition tables: 1 group:1 table:1 instance of SSIS. If the threshold of 6 GB was exceeded and an additional instance of SSIS was started, a second table would be used to store that group's and instance's data.

Netlogon and E-Mail Data

The number of netlogon and e-mail events is far fewer than the number of Web Proxy and Firewall Proxy events. Only a single table is required for netlogon events and a daily table for e-mail events. Netlogon and e-mail tables are indexed for faster query response. The following diagram illustrates this simple process for e-mail logs.

Bb961995.image006(en-us,TechNet.10).jpg

Figure 6. E-mail Data Organization

Storage Infrastructure

The Microsoft IT Utility Services group designed, built, and supports the SAN infrastructure component of ICE 3.0.

The SAN, deployed by using redundant fiber channel (FC) network fabrics, includes a total usable data storage capacity of 40 terabytes, distributed across two dedicated storage controllers with 80 terabytes of total raw capacity. This configuration provides end-to-end redundant physical connections, and each server has four or more paths to each storage controller.


Each Logical Unit Number (LUN) spans 12 or 16 physical disks and is configured as a single 200 GB NTFS partition, which is configured as a basic disk.

Note: As per Microsoft best practices, the SQL Server 2005 server is configured to use small, 200 GB LUNs instead of fewer, larger volumes.

Bb961995.image007(en-us,TechNet.10).jpg

Figure 7. SAN Architecture for ICE 3.0

All data, log, and tempdb volumes are stored on RAID1+0 (disk mirroring and striping), and the remaining file shares are housed on RAID5 (disk striping with distributed parity). The storage capacity is allocated as follows.

Table 5. Storage Allocation and Utlization

ICE Component

RAID Configuration

Allocation (Post- RAID)

Current Usage

Storage of proxy logs

RAID5

6.5 terabytes

5 terabytes

Database

RAID1+0

40 terabytes

27 terabytes*

*Because of the variable volumes of data and the scheduled flushing of data from the system, the size of the database fluctuates based on event activity.

Testing: Performance and Tuning

All elements of the system required evaluation, both independently and as part of the whole system. These elements included:

  • Hardware storage
  • Application
  • Operating system platform
  • Database platform
  • Security and privacy

Generating the volumes and diversity of data projected to be as much as 1.2 terabytes per day in a test lab was impossible. Therefore, ICE 3.0 testing was completed by using manufactured baseline numbers and data points. Test scripts were used to generate the approximated data volume.

Sample data was collected and duplicated by using automated tools to generate the required volume of the most realistic data possible for testing. The system was tested to load as much as 60 GB per hour, and three terabytes was loaded into the database. Because of implementation timeline constraints, the database could not be tested with more data.

The following table lists the internal tools used to test volume and performance.

Table 6. Internal Tools Used for Testing

Tool Name

Description

SQLCoverageAnalyzer

Used SQL Server Profiler to collect run-time information about code coverage using test inputs for SQL Stored Procedures, allowing code-paths to be analyzed.

DATAGen

Quickly generated volumes of data in SQL Server databases. It also randomly selected data from other database tables or files.

EatMemory

Consumed all available system memory to facilitate testing applications under low-memory conditions.

CPUStress

Allowed specification of CPU utilization for stress testing to determine application stability.

Query Performance

Tool developed for the ICE 3.0 project that monitors query performance for five trials in five different time zones.

Data Generator

Tool developed for the ICE 3.0 project that dynamically generated log files for any run hour on any number of servers. This tool allowed for configuration of valid, invalid, error, and filtering records.

 

After ICE 3.0 was certified to handle these large data volumes efficiently, deployment was planned with the confidence the system would be successful.

Note: The Application Consulting and Engineering team also completed comprehensive security and privacy assessments before taking the application to production.

Performance

The following performance metrics were collected during a two-hour peak period on the SQL Server 2005 server by using Performance Monitor (perfmon).

Table 7. Performance Data

Metric

Observation

Workload profile

60% read, 40% write

Average throughput

52 MB/second, peak of 383 MB/second

Disk transfers (IOPS)

680 transfers/second (IOPS average), peak of 3211 transfers/second

Average response time
(disk IO)

Sub-12 milliseconds

 

IO performance was optimized by increasing the host bus controller parameter, HBA queuedepth, to 128 from the default of 32.

Note: The HBA queuedepth parameter acts as an IO throttle; setting the queuedepth too low can adversely affect performance. For more information, visit the following Microsoft Web site: http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/pdpliobp.mspx

During testing, it was determined that the SQL Server 2005 server required more memory, so the RAM was increased from 8 GB to 32 GB.

Splitting the SSIS and SQL Servers

Each instance of SSIS handles 6 GB of data per hour. During initial production, it was determined that providing the required throughput while supporting queries created a CPU bottleneck. Initial deployment housed the Load Balancer and the database on the same server.

During implementation, the volume of incoming traffic was much larger than expected. During peak traffic periods, as many as 10 instances of SSIS would be started, which required 1.5 GB of RAM. The production server had only 8 GB of RAM.

To address these issues, ICE 3.0 was split onto two servers, one housing the ICE 3.0 Application Server, and the second housing the SQL Server 2005 database. The SAN LUNs were also reconfigured. Additional performance improvements were achieved by increasing the amount of RAM on the database server to 32 GB. By separating the servers and adding RAM, without additional code changes, the system was able to process the backlog of data that was created because of the speed challenges during initial deployment and began to process new data. The split occurred over a weekend and did not disrupt ICE 3.0 operations.

After the addition of the second server, testing verified that throughput capacity had increased to handle 120 GB per hour on peak traffic days.

When testing network bandwidth and benchmarks, it was determined that a gigabit fat pipe configured between the ICE 3.0 Application Server and the SQL Server 2005 database did not provide better throughput than the Microsoft global corporate network.

Note: For hardware configuration details, refer to the "Hardware and Sizing" section earlier in this paper.

Data Retention

Data is retained in the database for 60 days. After this time, data is automatically flushed from the system. SQL Server 2005 table partitioning includes a feature to delete table information, performing daily table removal and cleanup operations. At the time of switch-in, only the hourly partitions are switched-in to the daily tables for efficient loading; during deletion, ICE switches out the entire table for the day.

Note: For more information about SQL Server switching, visit the following Microsoft Web site: http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/loading_bulk_data_partitioned_table.mspx

Backup and Recovery

ICE 3.0 focuses on the storage and analysis of current data. Backups of the database and transactions are not maintained.

For data integrity, RAID 1+0 has proven very reliable: not a single database incident has occurred in five years. When ICE 2.0 experienced a suspect incident, DBCC was used to repair the database, and the loss was limited to a few days' worth of data.

Alerts

Microsoft Operations Manager (MOM)-based alerts were configured to warn the support personnel of issues with the system. The following table lists example ICE 3.0 alerts.

Table 8. Alert Examples

Alert Category

Description

Logical disk

% Free <5%

ICEv3:SQLMonitor-uspBkpICEtables

Monitor success master table backup

Service and application management

Status of core Windows services, unexpected service terminations, service account, and authentication issues

Performance threshold monitoring

1. Physical Disk — Avg. Disk sec./

2. Physical Disk — Avg. Disk sec./Read

3. Memory — Pages/sec.

4. Processor — % Processor

5. Processor — % DPC

6. Processor — % Interrupt Time

7. Memory — % Committed bytes in use

8. Memory — Available Megabytes

Reporting and SSRS

Using the SSRS features of SQL Server 2005, the queries used most frequently for each data source were included in canned reports created for ICE 3.0. The reports were fine-tuned to provide optimal performance and parameterized to accept different input values.

Note: For more information about SSRS, visit the following Microsoft Web site: http://www.microsoft.com/technet/prodtechnol/sql/2005/2005ssrs.mspx

Deployment

Both ICE 2.0 and ICE 3.0 systems ran in parallel for two weeks to compare and validate the new implementation. After two weeks, the ICE 2.0 database was marked as read-only, and its data was retained in read-only mode for 60 days. Then, the data was purged as aged data.

No downtime was required during deployment because of the parallel operation of the old and new ICE implementations.

Note: Additional data flow pull processes will be added as new Proxy, e-mail, and domain servers come online.

Database Permissions

Only the ICE 3.0 Production team has write access to the database; this is accomplished by using security groups. The Production team is also the only group that has access to the complete database.

SQL Server Configuration Tuning

During testing and after the initial rollout, SQL Server 2005 configuration steps included the following:

  • The recovery model was set to Simple.
  • Separate disks were configured for data and log files.
  • Initial deployment created file groups for Web Proxy, Firewall Proxy, logon, and e-mail going into the primary file groups.
  • Disks were reconfigured so that the data and the index had separate file groups.
  • The database files were configured to start at 10 GB and grow to 200 GB.
  • 50 database files were created for Web Proxy data, and 50 database files were created for Firewall Proxy data.
  • Multiple data files were pre-allocated on multiple logical units (mount points) to increase load efficiency as recommended by the Rightsizing team.
  • Implemented Microsoft IT Utility Services data center standards.

Tools used for tuning and troubleshooting the database included the following:

  • Performance Monitor, included with Windows Server 2003:
  • Provided counters for RAM, I/O, Network, and CPU.
  • Run over a multi-week period to determine valleys, peaks, and average values for these resources.
  • Determined that 8 GB of RAM was insufficient for the ICE 3.0 database server, which was then upgraded to 32 GB.
  • Single core servers were used on the application and database servers. During testing, CPU utilization was greater than 70%. By fine-tuning the application and by stopping unrequired services and programs on the server, CPU utilization was brought below 70%.
  • Focused on maximizing the CPU and memory available to the applications.
  • Database Engine Tuning Advisor (formerly Index Tuning Wizard), included with SQL Server 2005:
  • Used to determine the right set of required indexes.
  • SQL Profiler:
  • Used to verify the query patterns.
  • Query Analyzer:
  • Used to optimize the query execution plans using the "Show Execution Plan" option to fine-tune queries.
  • SQLIO Disk Subsystem Benchmark Tool:
  • Used to determine SAN Server IO throughput.

Note: For more information about SQLIO, visit the following Microsoft Web site: http://www.microsoft.com/downloads/details.aspx?familyid=9a8b005b-84e4-4f24-8d65-cb53442d9e19&displaylang=en.

Training

Users of the ICE system were given training on the new system. Training included an introduction to the new database schema and an orientation to the tables and views available for queries. Users were also given instruction on how to use the specific regional views that grouped servers geographically to help users in a particular location quickly locate an event.

Users were instructed on the use of specific indexes and the use of the index column in their queries. Orientation to the standardized reports was also provided.

The production support teams were trained on how to support and maintain the system.

Project Successes

The ICE 3.0 system went live in July 2006 after 10.5 months of design, development, and testing.

By using the staging tables to separate loading from querying, the blocking issue experienced in ICE 2.0 with concurrent data loading and querying was eliminated in ICE 3.0.

Since deployment, ICE 3.0 has processed five times more data per day on average than ICE 2.0; ICE 3.0 currently processes an average of 550 GB per day. The amount of data processed increases to 1.2 terabytes on peak traffic days such as Mondays and Tuesdays. Using SQL Server 2005 has allowed the total potential database capacity to increase to 40 terabytes. Additionally, data retention increased four-fold, from 15 to 60 days.

ICE 3.0 has an improved notification system, using MOM alerts in SQL Server 2005 to identify data and processing errors.

The re-architected design, views, and standardized reports have reduced the number of wildcard queries used; when forensic security engineers run wildcard queries, those queries no longer block the whole system. Standardized reports were created to provide engineers with their most common queries, reducing the number ad hoc queries on the system.

Project Challenges

During development, the initial goal of handling 150 GB of data per day was determined to be insufficient. Upon review, several upstream Proxy servers had been omitted during initial planning. The data generated by these additional upstream servers increased the projected daily data load to 550 GB per day.

A delay occurred during user acceptance testing because the flat file logs were being delivered erratically to the central file share. In addition, stale events were being pushed into the system. A modification to the Load Balancer was made to make sure late-arriving data was processed and loaded into the database without creating re-indexing issues.

When planning ICE 3.0, estimated data volumes were used for unpredictable patterns of incoming traffic. Estimated data capacities for data transport, transformation, and storage were used when designing and testing the solution. Additionally, the number of upstream servers had to be estimated because exact numbers were impossible to determine; the Microsoft global corporate network is constantly growing and changing.

The ICE 3.0 design had to be sufficiently flexible and scalable to service the network as it developed. Development of the Load Balancer was especially challenging because it needed to handle undetermined data volumes and patterns; therefore, the developers estimated data volumes and patterns during planning.

Fine-tuning performance and identifying optimal configuration was also challenging; the size and complexity of the system made it difficult to locate and address bottlenecks.

Storing the data on the SAN, separate from the SQL Server 2005 server, added complexity during evaluation and tuning. This included testing and optimization of the network components and the server to SAN communication.

Lessons Learned

  • Keep the design open for multiple deployment options. The design of ICE 3.0 allowed the ICE Application Server, which housed SSIS, and the SQL Server 2005 database to be easily moved onto separate servers when the incoming data volume began to exceed 550 GB per day. This allowed ICE 3.0 to scale to the current volume of approximately 1.2 terabytes on peak traffic days.
  • When initially assembling the project team, engage complementary skill sets, including team members who have expertise in hardware sizing, storage sizing, and SQL Server.
  • When collaborating with geographically dispersed development teams, plan for time-zone complications.
  • Using existing Microsoft global corporate network infrastructure provided sufficient bandwidth to handle communication between the ICE Application Server and the SQL Server 2005 database. Testing showed performance of a dedicated connection was comparable to the performance of the corporate LAN connection.
  • Use SQL Management Studio and Showplan only when required; these components consume essential system resources, and potentially adverse affects would be magnified when there are thousands of database objects in the system.
  • Run the Index Fine Tuning Wizard on the production workload to determine the correct indexes. Store indexes in a separate table to accommodate for space-versus-speed decisions after deployment.
  • Create indexes in sequence when using multiple processor servers. Each index creation is parallelized across multiple processors.
  • Create distributed views on top of partitioned tables. These views make the system flexible enough to scale to very large volumes of approximately 27 terabytes.
  • Exploit the flexibility of SSIS. SSIS can be invoked externally via command line switches; this allows ICE to spawn multiple package instances based on the volume of incoming data.
  • Use Performance Monitor (perfmon) to test server performance elements such as disk throughput.
  • Enable cache memory on the SAN.

Future ICE Development

The success of ICE 3.0 has led to plans for additional investment to increase the size and scope of ICE.

Planning for ICE 4.0 has already begun; a new SAN system has been ordered with an increased 10 terabytes of storage space.

Design considerations for ICE 4.0 include the following:

  • Handle 20 additional types of events, such as DHCP server events and telephone logs.
  • House new event data in a consolidated ICE database to centralize forensic investigation, eliminating disparate sources.
  • Manage an increase in input volume of 4 to 5 terabytes per day.
  • Reduce the total number of tables and views; for example, have one table for on time data, another table for laggard data, and a view that includes both.
  • Examine the approach to data population, perhaps building the logic into the Load Balancer.
  • Use instances of SSIS to merge new logs into the staging table.
  • Provide more standardized reports.
  • Exploit features in the next generation of SQL Server.
  • Potentially dedicate 80% of SAN cache memory to write activity.

With an increase in the types of events being captured, the ICE database size will increase from approximately 27 terabytes to an estimated 40 terabytes, creating one of the largest SQL Server database implementations in the world.

Development of ICE 4.0 will begin in early FY08.

Conclusion

The Information Security Consolidated Event Management System 3.0 project achieved its goals of addressing the issues faced by the preceding SQL Server 2000-based system, ICE 2.0, in the areas of reliability, data capacity, and performance. ICE 3.0 provides an excellent example of the VLDB capabilities of SQL Server 2005, maintaining a single database instance of approximately 27 terabytes with the ability to scale further. By exploiting features introduced with SQL Server 2005, such as SSIS, ICE 3.0 handles significant volumes of data, reliably processing more than 550 GB per day and more than 9 billion events per month. Using stage and partition tables, ICE 3.0 provides a highly organized data management solution that supports concurrent data loading and reporting without sacrificing end-user performance.

 

For More Information

For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada information Centre at (800) 563-9048. Outside the United States and Canada, please contact your local Microsoft subsidiary. To access information through the World Wide Web, go to:

http://www.microsoft.com

http://www.microsoft.com/technet/itshowcase

This is a preliminary document and may be changed substantially prior to final commercial release of the software described herein.

The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.

This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED, OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.

Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.

Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.

© 2007 Microsoft Corporation. All rights reserved.

Microsoft, Microsoft SQL Server 2005, Microsoft Windows Server 2003® and Microsoft SQL Server 2000™ are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.

 All other trademarks are property of their respective owners.

 

Appendix A: Performance Comparison of SSIS and BCP

During the proof of concept stage of this project, various methods of loading data were tested: SELECT INTO, SSIS, and BCP. SSIS was determined to be faster than BCP, particularly when working with larger file sizes.

Table A1. BCP and SSIS Data Processing Times

File Size

BCP (Time Required)

SSIS (Time Required)

200 MB

26 seconds

10 seconds

2.17 GB

19 minutes

10 Minutes

4.34 GB

n/a

18 minutes

 

Appendix B: Processing Values from Production ICE 3.0 System

The following tables and graphs display processing data for the week of May 1-7, 2007.

Table B1. Data Processed (GB)

Date

On Time Firewall Proxy Data

On Time Web Proxy Data

Total On Time Data

Laggard Firewall Proxy Data

Laggard Web Proxy Data

Total Laggard Data

Total Data Processed (GB/Day)

1-May

341

300

641

190

158

348

990

2-May

343

326

669

180

141

321

990

3-May

400

320

720

190

147

337

1058

4-May

359

287

647

140

128

268

914

5-May

305

323

627

150

156

306

933

6-May

178

113

291

130

107

236

528

7-May

162

90

251

115

125

240

491

 

Bb961995.image008(en-us,TechNet.10).jpg

Figure B1. Data Processed (GB)

 

Table B2. Processing Time Taken (Hours)

Date

On Time Firewall Proxy Data

On Time Web Proxy Data

Laggard Firewall Proxy Data

Laggard Web Proxy Data

Maximum Time Required

1-May

18

14

23

23

23

2-May

18

14

23

19

23

3-May

19

14

23

18

23

4-May

18

15

24

20

24

5-May

18

15

22

20

22

6-May

15

7

26

17

26

7-May

12

7

21

18

21

 

Bb961995.image009(en-us,TechNet.10).jpg

Figure B2. Processing Time Taken (Hours)

 

 

 

Appendix C: Database Growth Over Time

The following graph illustrates how the data in the ICE 3.0 database grows over a 90-day period and illustrates why data is retained in ICE 3.0 for a finite period.

Bb961995.image010(en-us,TechNet.10).jpg

Figure C1. Database Growth (GB)

Appendix D: LUN Best Practices

LUNs

Basic Disks

Microsoft IT Utility Services supports clusters within their IT best practices; clusters on Windows Server 2003 are not supported on dynamic disks. Therefore, SAN configurations are housed on basic disks. The features that are provided by dynamic disks are also provided at the hardware level in the SAN hardware controller; therefore, dynamic disks are not used.

LUN Size and Performance

Microsoft IT Utility Services best practice is to use a larger number of smaller LUNs when creating a SAN. The reasons for this best practice include the following:

  • CHKDSK scans take longer to complete when they are run on volumes with a large number of files. By keeping the LUNs smaller and increasing their number, CHKDSK scans run more quickly.
  • Reduces the system effect of a single LUN failure.
  • Windows queues disk I/O on a per LUN basis. Therefore, having a larger number of smaller LUNs is a practical strategy for optimal performance.

For more information about how to plan for optimal I/O on SQL Server 2005, refer to the Microsoft TechNet white paper, "Predeployment I/O Best Practices, SQL Best Practices Article," at http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/pdpliobp.mspx.

For more information about how to support large logical disks on Windows, refer to the Windows Hardware Developer Central article, "Large Logical Unit Support and Windows Server 2003 SP1," at http://www.microsoft.com/whdc/device/storage/LUN_SP1.mspx.

Did you find this helpful?
(1500 characters remaining)
© 2013 Microsoft. All rights reserved.