How Microsoft IT Uses SQL Server 2005 to Power a Global Forensic Data Security Tool
How Microsoft IT Uses SQL Server 2005 to Power a Global Forensic Data Security Tool
Technical White Paper
Published: November 2, 2007
The Information Security Consolidated Event Management System (ICE 3.0)
|
Situation
|
Solution
|
Benefits
|
Products & Technologies
|
|
Microsoft constructed a global forensic security system using the features of Microsoft
SQL Server 2005. This paper discusses the business case, project, and resulting
product that Microsoft partners and customers can use as proof of concept when planning
large data management implementations.
|
ICE 3.0, the next generation of the custom Microsoft forensic event management system,
is specifically designed to collect, filter, and house significant events for research
and evidentiary use by forensic security engineers.
|
ICE 3.0 features the following benefits:
- Provides significant performance gains compared to the previous system.
- Increases data capacity by a factor of four compared to the previous system.
- Provides a re-architected approach to ensure data availability for ad hoc
and preconfigured queries.
- Leverages the features of Microsoft SQL Server 2005.
|
- Microsoft SQL Server 2005
- Microsoft Windows Server 2003
- Very Large Databases (VLDBs)
- Storage Area Networks (SANs)
|
Executive Summary
The purpose of this white paper is to share architecture, design,
and deployment considerations of and discuss the knowledge gained during the Microsoft
implementation of SQL Server 2005 to power a global forensic security tool. Additionally,
this paper demonstrates the value of current Microsoft products for capturing, filtering,
organizing, storing, and analyzing network event data from a widely dispersed, complex
corporate environment. This paper introduces the technologies employed, discusses
the business case for the project, and describes the development and deployment
effort to implement the solution.
Many of the principles and concepts described in this paper can be employed to develop
systems that capture and warehouse very large volumes of data within any organization.
Similarly, the design considerations for data capture and storage infrastructures
can be applied to most enterprise-scale IT environments through Microsoft products.
However, this paper is based on Microsoft Information Technology's (IT) experience
and recommendations as an innovator. It is not intended to serve as a procedural
guide. Each enterprise environment has unique circumstances; therefore, each organization
should adapt the plans and lessons described in this paper to meet its specific
needs.
Note: For security reasons, the sample names of servers, tables, internal
resources, organizations, and internally developed security systems that are used
in this paper do not represent real resource names that are used within Microsoft
and are for illustration only.
Introduction
Audience
This paper assumes that readers are technical decision makers who are already familiar
with Microsoft SQL Server 2005, including SQL Server Integration Services (SSIS).
This paper also assumes a basic understanding of storage technologies such as Very
Large Database Systems (VLDBs) and Storage Area Networks (SANs). This paper also
discusses the concept of global forensic security systems.
Note: A TechNet Webcast on this topic was delivered on June 19, 2007. This Webcast
is available for download from Microsoft TechNet. You can also download a Podcast
of this event at http://www.microsoft.com/downloads
Technology Overview
Very Large Database (VLDB)
A VLDB is a relational database system that contains an extraordinarily large set
of data. The term VLDB can also be used to describe a database that handles a large
number of complex transactions per second. Implementing and maintaining VLDBs presents
unique hardware, software, and network challenges, including the need for very large
supporting storage systems to house vast quantities of data.
Storage Area Network (SAN)
A SAN is a dedicated network architecture designed to provide storage services.
SANs make large amounts of storage available to servers, appearing as a local device
to the server. Typically, SANs are used to support transactional data management
systems that require high-speed access to the data housed on the Logical Unit Numbers
(LUNs).
Note: The LUN is the identifier that is used to address a disk or disk partition
within a SAN.
Forensic Event Systems
Forensic event systems monitor network traffic and selectively record specific events
to provide post-incident tracking and research tools for investigation and evidence
collection. Because each corporation's network, business, and legal outlook is unique,
systems must be customized to capture information pertinent to that business and
to make that data available to forensic security engineers for evaluation.
SQL Server 2005 Integration Services (SSIS)
SSIS is a toolset included with Microsoft SQL Server 2005 that provides a platform
to develop custom applications for data management. With SSIS, developers can build
packages to update data warehouses, to interact with external processes, to clean
and mine data, to process analytic objects, and to perform administrative tasks.
Introduction to the Information Security Consolidated Event Management System
ICE: An Overview
The Information Security Consolidated Event Management System (ICE) is an internal
global forensic security tool developed by Microsoft for the Information Security
(InfoSec) team to capture, filter, store, and query security events that occur across
the Microsoft global corporate network.
ICE collects incoming and outgoing events on the Microsoft global corporate network
24 hours per day, seven days per week. The current implementation captures the following
types of information:
- E-mail events from Microsoft Exchange servers
- Logon events from Microsoft Windows Server domain controllers
- Web browsing and firewall events from Microsoft ISA Servers (Proxy servers)
- Event properties such as locations, servers, and times
The number of events and their frequency require that a system be able to manage
very large amounts of data and to use sophisticated hardware and software solutions
such as SANs and VLDBs.
More than 300 million events are captured and stored daily in a single database
that can be queried and analyzed by forensic security engineers when they investigate
events for forensic evidence.
ICE User Profile
The ICE system is used by the following two distinct sets of consumers:
- Security team: The Security team queries the database for forensic information
to investigate activity by user aliases, IP addresses, specific URLs, and other
data points. The Security team is made up of the following three teams:
- Tier 1: Tier 1 scans new data each day to identify any suspicious activity.
- Tier 2: Tier 2 investigates cases opened by Tier 1. Tier 2 tries to determine the
source of the activity and whether there is sufficient evidence to visit the machine
directly.
- Tier 3: Tier 3 becomes involved in Tier 1 or Tier 2 cases if the data suggests that
an intentional breach of policy or malicious activity has occurred. The information
collected by Tier 3 is used to assist the legal teams.
- Production Support team: The Production Support team maintains the application
that feeds data into the database, manages new servers, and handles data processing
issues.
ICE: Business Case
The previous Microsoft forensic event system, ICE 2.0, was based on Microsoft SQL
Server 2000 and could not scale to manage the growing amounts of data experienced
on the Microsoft global corporate network. ICE 2.0 did not provide features desired
by data consumers and database administrators and posed the following challenges:
- Data delivery from the desired servers across the global corporate network was inconsistent.
ICE 2.0 used a pull mechanism to retrieve event data from all desired servers across
the global corporate network.
- SQL Server 2000 did not provide the required system of error notification. This
necessitated frequent manual intervention if errors occurred in the data when loading
or processing that data.
- ICE 2.0 was not designed to use indexes for query optimization; therefore, queries
could take six to seven hours to process. Carelessly formatted wildcard queries
could cause the ICE 2.0 system to stop responding entirely.
- The system did not effectively handle concurrent actions. Simultaneous requests
to load and query data frequently blocked the whole system.
- The amount of data that could be captured daily from all the global servers was
limited to approximately 120 gigabytes (GB).
- Data could only be retained for 12 to 15 days.
- The maximum storage capacity of ICE 2.0 was approximately 2.5 terabytes.
- These challenges hindered the effectiveness of data capture and impaired the ability
of forensic security engineers to access and use the data.
Required Features
After analyzing the system and identifying its challenges, the Microsoft InfoSec
team determined that a new version of ICE was required to manage and monitor events
on the growing Microsoft global corporate network. To address this business need,
the following required features were identified for the new version of ICE:
- Data integrity: Provide a reliable mechanism for retrieving and loading information
into the database supported by a comprehensive error-reporting system.
- Data availability: Maintain all database records for at least 60 days.
- Scalability: Fulfill the current forensic event management needs, and grow
to support future needs.
- Performance: Run queries 50% faster than the existing ICE 2.0 System does.
- Wildcard searches: Quickly compose and run ad hoc queries designed by forensic
security engineers during investigations without affecting other aspects of the
system.
The ICE 3.0 Project
Project Objectives
The goal of ICE 3.0 was to address the challenges faced by the existing ICE 2.0
system as described earlier in this paper. The focus of the project was reliability,
scalability, and improved schema design to facilitate faster access to the data.
An additional opportunity was identified during the project-planning phase. The
initiation of the ICE 3.0 project coincided with the release of SQL Server 2005.
Therefore, from its inception, ICE 3.0 was designed to exploit the new capabilities
of SQL Server 2005, focusing on the following features:
- SQL Server Integration Services (SSIS), which could increase reliability and speed
of the data feed during both processing and loading.
- The newly introduced partition tables, which could increase database efficiency
and reduce query times by more than 50%.
- SQL Server Reporting Services (SSRS), which could provide standardized reports for
the forensic security engineers to reduce ad hoc querying.
With these requirements, ICE 3.0 provided the perfect proof of concept to demonstrate
VLDB capabilities of SQL Server 2005.
Project Approach
Goals
The goals for the ICE 3.0 project were organized into three categories by using
the easy-to-remember acronym "SQL" (storage, query, and load).
Table 1. Project Goals
|
Goal
|
Specifics
|
|
Storage-related
|
- Design the data load capacity to support over 150 GB of data per day (later increased
to 550 GB); a four-fold increase over ICE 2.0.
- Retain data for a minimum of 60 days, a four-fold increase over ICE 2.0; an outside
goal of 90 days was also considered.
- Increase database scalability to approximately 30 terabytes.
- Include Coordinated Universal Time (Greenwich Mean Time, UTC) in all log files.
- Use UTC for all information stored in the database.
|
|
Query-related
|
- Re-architect database schema on SQL Server 2005 for better performance for queries
and analysis.
- Provide standardized reports for forensic security engineers to reduce ad hoc querying.
- Support as many as 25 concurrent users.
- Support the Unicode character set for all processing.
|
|
Load-related
|
- Redesign the feed system, creating a reliable method for gathering the logs that
contain all the global events each day.
- Provide a robust notification system for any errors that occurred in the data, in
event attributes, or in processing.
- Fortify the system to support capturing event traffic from more than 93 global Microsoft
ISA Proxy servers, more than 15 Exchange e-mail servers, and global employee network
logon events.
|
Out-of-Scope Elements
The ICE 3.0 project focused on the process of obtaining and using data within an
immediate and continuous timeframe. Storing and retrieving historical data was not
within the scope of the system nor the ICE 3.0 project. Additionally, backup and
data recovery of the database were not included as part of this project.
Geographic Scope
Like the ICE 2.0 system, ICE 3.0 would collect events from across the Microsoft
global corporate network.
Timelines
The ICE 3.0 project took less than one year to complete; it was initiated in September
2005 and went live in July 2006.
The project was divided into the following four phases.
Table 2. The Four Project Phases
|
Phase
|
Duration
|
Details
|
|
Envision and design
|
1.5 months
|
Completed business case; defined goals, resources, and times
|
|
Development
|
5.5 months
|
Completed by using the Software Engineering Institute's (SEI) Team Software Process
and Personal Software Process (TSP-PSP)*
|
|
System testing
|
1 month
|
Completed by using SEI's TSP-PSP*
|
|
Stabilization and deployment
|
2 months
|
Included user acceptance testing
|
*For more information about SEI Software Processes, visit www.sei.cmu.edu/tsp/.
Team
The ICE 3.0 project was a collaborative venture designed and completed among multiple
teams at Microsoft, including the following:
- Cross IT (XIT) group
- SQL Server Products group
- Information Security (InfoSec) group
- Network Security (NetSec) group
- IT Utility Services group
- Application Consulting and Engineering (ACE) group
Hardware and Sizing
To determine hardware requirements, estimated data points were used during planning,
based on information collected from other internal projects. Additionally, an internal
Microsoft team, Rightsizing, which specializes in determining and recommending
server platforms that meet business requirements for internal implementations, assisted
in determining the optimal hardware configuration for this project. Forecasts for
load volumes and throughput were used to determine appropriate hardware recommendations.
Based on these forecasts, the required hardware was identified and purchased early
in the project, allowing the IT team to use the hardware for staging and testing
during development. The hardware identified successfully carried ICE 3.0 through
development and into production.
Note: Among other tools, the Rightsizing team uses Performance Monitor (perfmon)
to determine the location of bottlenecks on a server and then design an appropriate
hardware solution.
The following table describes the hardware that is currently in production for ICE
3.0.
Table 3. ICE 3.0 Hardware Details
|
Role
|
Hardware Details
|
|
ICE SQL Server
|
Windows Server 2003 x64
4x2.2GHz CPU
32 GB RAM
|
|
ICE Application Server
|
Windows Server 2003 x64
4x2.2GHz CPU
8 GB RAM
|
Constraints
As with any infrastructure project, the main constraints were related to the budget
and timeline. When selecting the hardware, budget constraints necessitated the use
of quad-processor, single way systems. Time challenges included creating and testing
an untried solution to handle an undetermined number of volumes without production-like
development, test, or UAT environments. The team had to design the system, extrapolate
the test observations, and then take a leap of faith, hoping that their approach
would bring success.
Design and Development Approach
The approach to developing ICE 3.0 needed to be holistic, integrating all the following
elements:
- Application servers, hardware
- Storage servers, hardware, SAN
- Network infrastructure and interfaces
The Development team used proof of concepts, architecture approach, and design approach
when addressing key goals such as utilizing multiple instances of SSIS and removing
the query-blocking issues. Other options considered included Bulk Copy Program (BCP),
bulk inserts, and SQL Server full-text searches.
Technologies
As mentioned earlier in this paper, ICE 3.0 was designed to exploit the new capabilities
of SQL Server 2005, focusing on the following:
- SQL Server Integration Services (SSIS), which could increase reliability and speed
of the data feed during both processing and loading.
- The newly introduced partition tables, which could increase database efficiency
and reduce query times by more than 50%.
- SQL Server Reporting Services (SSRS), which could provide standardized reports for
the forensic security engineers to reduce ad hoc querying.
ICE 3.0 Application Architecture
ICE 3.0 was designed using the N-Tier approach, as illustrated in the following
diagram.
Figure 1. Basic ICE 3.0 Architecture
Process Flow
.jpg)
Figure 2. ICE 3.0 Process Flow View 1
Table 4. ICE 3.0 Process Flow
|
Step
|
Process Step
|
Description
|
|
1
|
Proxy, e-mail, and logon servers log data
|
E-mail, logon, and Web browsing events are logged by servers on the Microsoft global
corporate network into their flat files.
|
|
2
|
Collect data
|
Flat files are stored in file shares on their respective servers.
|
|
3
|
Extract data
|
By using a push process, flat files from each server's file share are extracted
and then copied over the Microsoft global corporate network to a centralized file
share on the ICE Server.
|
|
4
|
Preprocess and transform data
|
The loading agent detects the new files, filters the events for desired data (reconfigurable
as determined by the forensic security team), and then transforms the data into
formats required by the database.*
|
|
5
|
Load data
|
Filtered event and attribute data is loaded into partition tables in the ICE database.
|
|
6
|
Index
|
Tables are indexed to improve query response time.
|
|
7
|
Query
|
Data is housed in the ICE database and can be queried or included in standardized
reports used by the Microsoft InfoSec team.
|
*After being processed, log files are moved to a separate folder and retained for
a short period before being deleted.
.jpg)
Figure 3. ICE 3.0 Process Flow View 2
Load Balancer
The data entering the ICE 3.0 system comes in indiscrete masses based on the traffic
patterns, work hours, geographic location, and loads experienced on the global corporate
network. To handle the very large amounts of data quickly and effectively without
data loss, ICE 3.0 depends primarily on a dedicated Load Balancer control program.
The Load Balancer handles the complete dataflow process and was developed to respond
to changing data patterns. The Load Balancer is responsible for handling the very
large amounts of data that enter the ICE system daily; it is engineered to easily
handle 550 GB of data per day.
The Load Balancer is responsible for monitoring the central file share area for
new flat files deposited by remote server groups. When new files are detected, the
Load Balancer transforms the files into events and attributes, which are then loaded
into the ICE 3.0 database.
Server Groups
For manageability, the upstream Proxy servers push their log files into the ICE
3.0 system. Then, the servers are grouped based on region and/or data volumes. For
example, two Proxy servers from Redmond could each constitute a server group; a
second server group could contain European Proxy servers; and a third group for
the Proxy servers in Asia. Additional server groups contain the remaining types
of servers, e-mail, and domain controllers. For this implementation of ICE 3.0,
five server groups were used.
Note: The data returned in log files is not geographically uniform; for example,
the amount of data that is collected from Redmond servers is far larger than other
locations. Servers are first grouped to balance the data volume, and then they are
grouped by geography.
Successive multi-day data was collected from 93 Microsoft ISA Proxy servers; using
SSIS foreach enumerators provided an efficient way to loop through multiple
flat files and folders to process the data from different servers.
SSIS: Process and Transform Data
ICE 3.0 is not a multi-threaded application but is a multi-instance application.
Multi-instancing, supported by SSIS in SQL Server 2005, enables ICE to seamlessly
respond to changes in incoming data volume by dynamically starting multiple instances
of SSIS packages.
ICE runs one instance of SSIS for each server group; each instance can respond to
as much as 6 GB of incoming data per hour from that server group. If the amount
of data that is received from that group exceeds 6 GB, a new instance of SSIS is
automatically started to respond to the overflow of data. As many as five additional
instances for the two log types, to a maximum of twenty instances, can be started.
The Load Balancer acts as the control application for all instances of SSIS.
The Load Balancer has been tested to load 40-60 GB of data per hour into the database.
Note: Specific throughput details are available in Appendix B, "Processing
Values from Production ICE 3.0 System."
SSIS supports the use of a configurable, conditional split operation for filtering
incoming events. This simple, code-free solution reduced the amount of unwanted
data entering the database by 30%, retaining only the data considered necessary
for forensic data analysis. This solution was easy to implement by using SSIS in
SQL Server 2005.
After data filtering is complete, the data is transformed into a format appropriate
for loading to the database.
SSIS also includes robust error and exception event mechanisms, which are especially
important with very large data files that support generic and specific exceptions
such as errors in flat file loading. All error and information messages are logged
to a central error information table, which facilitates troubleshooting.
Data Workflow
Microsoft has engineered the ICE 3.0 database to store vast quantities of data;
ICE 3.0 handles approximately 300 million Web and firewall Proxy events daily―over
18 billion events over 60 days. Careful design of the data workflow allows ICE to
support loading, transforming, and querying this large and complex database. SQL
Server 2005 allows ICE 3.0 to use indexed tables in conjunction with partition tables
to manage the data and eliminate the need for indexed views.
Tables
Stage Tables
Each instance of SSIS loads data into a scratch area, or staging table, in the database,
from which indexes are created. The number of indexes and the columns constituting
the indexes are kept in a file that is read to define the indexes that must be created
on the tables.
.jpg)
Figure 4. Data Workflow Overview
More specifically, the following process is used to organize data:
1. Stage
tables are created for each source group, for each instance of SSIS, and for each
run date and hour.
2. Stage
tables are switched into partition tables, which are created for each source group,
for each instance of SSIS, and for each run date.
3. Views
are created for each source group and for each run date.
4. Consolidated
views are created for all source groups for a single run date.
Note: Stage tables eliminated the blocking issues experienced in ICE 2.0,
enabling data to be loaded and queried at the same time.
Partition Tables
ICE 3.0 exploits SQL Server 2005's support of partition tables; this allows tables
to be segmented and smaller indexes to be created. Partition tables eliminate the
need to re-index the entire table when new data is added. When a system is working
with millions of events, partition tables make data management far more efficient.
Data Organization
The events collected by ICE 3.0 occur across all time zones included in the Microsoft
global corporate network. With time as an attribute of every event collected, time
was the common value used to create the partition table structure.
For Proxy data, 24 partition tables were created, each representing a UTC hour to
load and organize the global data collected in a single day. Using time to organize
the partition table structure also enabled easy management of upper and lower limits.
From the partition tables, various views were generated and consolidated. Views
exposed the data and helped make the underlying database structure transparent to
the forensic security engineer, facilitating and optimizing querying.
The number of proxy events far exceeded the number of e-mail or netlogon events;
therefore, different approaches were taken for data organization.
Proxy Data
.jpg)
Figure 5. Sample Table Structure for Proxy Data
The following process outlines how the tables and views are created for Proxy server
event data:
1. Each
instance of SSIS loads data into stage tables created for each source group.
2. Stage
tables are switched into partition tables. Twenty-four partition tables have been
created for each server source group, instance, and date to enable events from the
same day to be loaded into the same table.
3. Views
are created for each server source group, instance, and date.
4. A
consolidated view is provided above these base partition tables. Other views, such
as daily and weekday views, are created to enable more efficient query construction
when composing queries based on timeframes.
5. Merged
views are created, consolidating daily views for each run date.
6. Grouped
views are created for calendar weeks. Note that while grouped views are being created,
new queries are blocked until view creation is complete.
Note: The ratio for both stage and partition tables: 1 group:1 table:1 instance
of SSIS. If the threshold of 6 GB was exceeded and an additional instance of SSIS
was started, a second table would be used to store that group's and instance's data.
Netlogon and E-Mail Data
The number of netlogon and e-mail events is far fewer than the number of Web Proxy
and Firewall Proxy events. Only a single table is required for netlogon events and
a daily table for e-mail events. Netlogon and e-mail tables are indexed for faster
query response. The following diagram illustrates this simple process for e-mail
logs.
.jpg)
Figure 6. E-mail Data Organization
Storage Infrastructure
The Microsoft IT Utility Services group designed, built, and supports the SAN infrastructure
component of ICE 3.0.
The SAN, deployed by using redundant fiber channel (FC) network fabrics, includes
a total usable data storage capacity of 40 terabytes, distributed across two dedicated
storage controllers with 80 terabytes of total raw capacity. This configuration
provides end-to-end redundant physical connections, and each server has four or
more paths to each storage controller.
Each Logical Unit Number (LUN) spans 12 or 16 physical disks and is configured as
a single 200 GB NTFS partition, which is configured as a basic disk.
Note: As per Microsoft best practices, the SQL Server 2005 server is configured
to use small, 200 GB LUNs instead of fewer, larger volumes.
.jpg)
Figure 7. SAN Architecture for ICE 3.0
All data, log, and tempdb volumes are stored on RAID1+0 (disk mirroring and striping),
and the remaining file shares are housed on RAID5 (disk striping with distributed
parity). The storage capacity is allocated as follows.
Table 5. Storage Allocation and Utlization
|
ICE Component
|
RAID Configuration
|
Allocation (Post- RAID)
|
Current Usage
|
|
Storage of proxy logs
|
RAID5
|
6.5 terabytes
|
5 terabytes
|
|
Database
|
RAID1+0
|
40 terabytes
|
27 terabytes*
|
*Because of the variable volumes of data and the scheduled flushing of data from
the system, the size of the database fluctuates based on event activity.
Testing: Performance and Tuning
All elements of the system required evaluation, both independently and as part of
the whole system. These elements included:
- Hardware storage
- Application
- Operating system platform
- Database platform
- Security and privacy
Generating the volumes and diversity of data projected to be as much as 1.2 terabytes
per day in a test lab was impossible. Therefore, ICE 3.0 testing was completed by
using manufactured baseline numbers and data points. Test scripts were used to generate
the approximated data volume.
Sample data was collected and duplicated by using automated tools to generate the
required volume of the most realistic data possible for testing. The system was
tested to load as much as 60 GB per hour, and three terabytes was loaded into the
database. Because of implementation timeline constraints, the database could not
be tested with more data.
The following table lists the internal tools used to test volume and performance.
Table 6. Internal Tools Used for Testing
|
Tool Name
|
Description
|
|
SQLCoverageAnalyzer
|
Used SQL Server Profiler to collect run-time information about code coverage using
test inputs for SQL Stored Procedures, allowing code-paths to be analyzed.
|
|
DATAGen
|
Quickly generated volumes of data in SQL Server databases. It also randomly selected
data from other database tables or files.
|
|
EatMemory
|
Consumed all available system memory to facilitate testing applications under low-memory
conditions.
|
|
CPUStress
|
Allowed specification of CPU utilization for stress testing to determine application
stability.
|
|
Query Performance
|
Tool developed for the ICE 3.0 project that monitors query performance for five
trials in five different time zones.
|
|
Data Generator
|
Tool developed for the ICE 3.0 project that dynamically generated log files for
any run hour on any number of servers. This tool allowed for configuration of valid,
invalid, error, and filtering records.
|
After ICE 3.0 was certified to handle these large data volumes efficiently, deployment
was planned with the confidence the system would be successful.
Note: The Application Consulting and Engineering team also completed comprehensive
security and privacy assessments before taking the application to production.
Performance
The following performance metrics were collected during a two-hour peak period on
the SQL Server 2005 server by using Performance Monitor (perfmon).
Table 7. Performance Data
|
Metric
|
Observation
|
|
Workload profile
|
60% read, 40% write
|
|
Average throughput
|
52 MB/second, peak of 383 MB/second
|
|
Disk transfers (IOPS)
|
680 transfers/second (IOPS average), peak of 3211 transfers/second
|
|
Average response time
(disk IO)
|
Sub-12 milliseconds
|
IO performance was optimized by increasing the host bus controller parameter, HBA
queuedepth, to 128 from the default of 32.
During testing, it was determined that the SQL Server 2005 server required more
memory, so the RAM was increased from 8 GB to 32 GB.
Splitting the SSIS and SQL Servers
Each instance of SSIS handles 6 GB of data per hour. During initial production,
it was determined that providing the required throughput while supporting queries
created a CPU bottleneck. Initial deployment housed the Load Balancer and the database
on the same server.
During implementation, the volume of incoming traffic was much larger than expected.
During peak traffic periods, as many as 10 instances of SSIS would be started, which
required 1.5 GB of RAM. The production server had only 8 GB of RAM.
To address these issues, ICE 3.0 was split onto two servers, one housing the ICE
3.0 Application Server, and the second housing the SQL Server 2005 database. The
SAN LUNs were also reconfigured. Additional performance improvements were achieved
by increasing the amount of RAM on the database server to 32 GB. By separating the
servers and adding RAM, without additional code changes, the system was able to
process the backlog of data that was created because of the speed challenges during
initial deployment and began to process new data. The split occurred over a weekend
and did not disrupt ICE 3.0 operations.
After the addition of the second server, testing verified that throughput capacity
had increased to handle 120 GB per hour on peak traffic days.
When testing network bandwidth and benchmarks, it was determined that a gigabit
fat pipe configured between the ICE 3.0 Application Server and the SQL Server 2005
database did not provide better throughput than the Microsoft global corporate network.
Note: For hardware configuration details, refer to the "Hardware and
Sizing" section earlier in this paper.
Data Retention
Data is retained in the database for 60 days. After this time, data is automatically
flushed from the system. SQL Server 2005 table partitioning includes a feature to
delete table information, performing daily table removal and cleanup operations.
At the time of switch-in, only the hourly partitions are switched-in to the daily
tables for efficient loading; during deletion, ICE switches out the entire table
for the day.
Backup and Recovery
ICE 3.0 focuses on the storage and analysis of current data. Backups of the database
and transactions are not maintained.
For data integrity, RAID 1+0 has proven very reliable: not a single database incident
has occurred in five years. When ICE 2.0 experienced a suspect incident, DBCC was
used to repair the database, and the loss was limited to a few days' worth of data.
Alerts
Microsoft Operations Manager (MOM)-based alerts were configured to warn the support
personnel of issues with the system. The following table lists example ICE 3.0 alerts.
Table 8. Alert Examples
|
Alert Category
|
Description
|
|
Logical disk
|
% Free <5%
|
|
ICEv3:SQLMonitor-uspBkpICEtables
|
Monitor success master table backup
|
|
Service and application management
|
Status of core Windows services, unexpected service terminations, service account,
and authentication issues
|
|
Performance threshold monitoring
|
1. Physical Disk — Avg. Disk sec./
2. Physical Disk — Avg. Disk sec./Read
3. Memory — Pages/sec.
4. Processor — % Processor
5. Processor — % DPC
6. Processor — % Interrupt Time
7. Memory — % Committed bytes in use
8. Memory — Available Megabytes
|
Reporting and SSRS
Using the SSRS features of SQL Server 2005, the queries used most frequently for
each data source were included in canned reports created for ICE 3.0. The reports
were fine-tuned to provide optimal performance and parameterized to accept different
input values.
Deployment
Both ICE 2.0 and ICE 3.0 systems ran in parallel for two weeks to compare and validate
the new implementation. After two weeks, the ICE 2.0 database was marked as read-only,
and its data was retained in read-only mode for 60 days. Then, the data was purged
as aged data.
No downtime was required during deployment because of the parallel operation of
the old and new ICE implementations.
Note: Additional data flow pull processes will be added as new Proxy, e-mail,
and domain servers come online.
Database Permissions
Only the ICE 3.0 Production team has write access to the database; this is accomplished
by using security groups. The Production team is also the only group that has access
to the complete database.
SQL Server Configuration Tuning
During testing and after the initial rollout, SQL Server 2005 configuration steps
included the following:
- The recovery model was set to Simple.
- Separate disks were configured for data and log files.
- Initial deployment created file groups for Web Proxy, Firewall Proxy, logon, and
e-mail going into the primary file groups.
- Disks were reconfigured so that the data and the index had separate file groups.
- The database files were configured to start at 10 GB and grow to 200 GB.
- 50 database files were created for Web Proxy data, and 50 database files were created
for Firewall Proxy data.
- Multiple data files were pre-allocated on multiple logical units (mount points)
to increase load efficiency as recommended by the Rightsizing team.
- Implemented Microsoft IT Utility Services data center standards.
Tools used for tuning and troubleshooting the database included the following:
- Performance Monitor, included with Windows Server 2003:
- Provided counters for RAM, I/O, Network, and CPU.
- Run over a multi-week period to determine valleys, peaks, and average values for
these resources.
- Determined that 8 GB of RAM was insufficient for the ICE 3.0 database server, which
was then upgraded to 32 GB.
- Single core servers were used on the application and database servers. During testing,
CPU utilization was greater than 70%. By fine-tuning the application and by stopping
unrequired services and programs on the server, CPU utilization was brought below
70%.
- Focused on maximizing the CPU and memory available to the applications.
- Database Engine Tuning Advisor (formerly Index Tuning Wizard), included with SQL
Server 2005:
- Used to determine the right set of required indexes.
- SQL Profiler:
- Used to verify the query patterns.
- Query Analyzer:
- Used to optimize the query execution plans using the "Show Execution Plan"
option to fine-tune queries.
- SQLIO Disk Subsystem Benchmark Tool:
- Used to determine SAN Server IO throughput.
Training
Users of the ICE system were given training on the new system. Training included
an introduction to the new database schema and an orientation to the tables and
views available for queries. Users were also given instruction on how to use the
specific regional views that grouped servers geographically to help users in a particular
location quickly locate an event.
Users were instructed on the use of specific indexes and the use of the index column
in their queries. Orientation to the standardized reports was also provided.
The production support teams were trained on how to support and maintain the system.
Project Successes
The ICE 3.0 system went live in July 2006 after 10.5 months of design, development,
and testing.
By using the staging tables to separate loading from querying, the blocking issue
experienced in ICE 2.0 with concurrent data loading and querying was eliminated
in ICE 3.0.
Since deployment, ICE 3.0 has processed five times more data per day on average
than ICE 2.0; ICE 3.0 currently processes an average of 550 GB per day. The amount
of data processed increases to 1.2 terabytes on peak traffic days such as Mondays
and Tuesdays. Using SQL Server 2005 has allowed the total potential database capacity
to increase to 40 terabytes. Additionally, data retention increased four-fold, from
15 to 60 days.
ICE 3.0 has an improved notification system, using MOM alerts in SQL Server 2005
to identify data and processing errors.
The re-architected design, views, and standardized reports have reduced the number
of wildcard queries used; when forensic security engineers run wildcard queries,
those queries no longer block the whole system. Standardized reports were created
to provide engineers with their most common queries, reducing the number ad hoc
queries on the system.
Project Challenges
During development, the initial goal of handling 150 GB of data per day was determined
to be insufficient. Upon review, several upstream Proxy servers had been omitted
during initial planning. The data generated by these additional upstream servers
increased the projected daily data load to 550 GB per day.
A delay occurred during user acceptance testing because the flat file logs were
being delivered erratically to the central file share. In addition, stale events
were being pushed into the system. A modification to the Load Balancer was made
to make sure late-arriving data was processed and loaded into the database without
creating re-indexing issues.
When planning ICE 3.0, estimated data volumes were used for unpredictable patterns
of incoming traffic. Estimated data capacities for data transport, transformation,
and storage were used when designing and testing the solution. Additionally, the
number of upstream servers had to be estimated because exact numbers were impossible
to determine; the Microsoft global corporate network is constantly growing and changing.
The ICE 3.0 design had to be sufficiently flexible and scalable to service the network
as it developed. Development of the Load Balancer was especially challenging because
it needed to handle undetermined data volumes and patterns; therefore, the developers
estimated data volumes and patterns during planning.
Fine-tuning performance and identifying optimal configuration was also challenging;
the size and complexity of the system made it difficult to locate and address bottlenecks.
Storing the data on the SAN, separate from the SQL Server 2005 server, added complexity
during evaluation and tuning. This included testing and optimization of the network
components and the server to SAN communication.
Lessons Learned
- Keep the design open for multiple deployment options. The design of ICE 3.0 allowed
the ICE Application Server, which housed SSIS, and the SQL Server 2005 database
to be easily moved onto separate servers when the incoming data volume began to
exceed 550 GB per day. This allowed ICE 3.0 to scale to the current volume of approximately
1.2 terabytes on peak traffic days.
- When initially assembling the project team, engage complementary skill sets, including
team members who have expertise in hardware sizing, storage sizing, and SQL Server.
- When collaborating with geographically dispersed development teams, plan for time-zone
complications.
- Using existing Microsoft global corporate network infrastructure provided sufficient
bandwidth to handle communication between the ICE Application Server and the SQL
Server 2005 database. Testing showed performance of a dedicated connection was comparable
to the performance of the corporate LAN connection.
- Use SQL Management Studio and Showplan only when required; these components consume
essential system resources, and potentially adverse affects would be magnified when
there are thousands of database objects in the system.
- Run the Index Fine Tuning Wizard on the production workload to determine the correct
indexes. Store indexes in a separate table to accommodate for space-versus-speed
decisions after deployment.
- Create indexes in sequence when using multiple processor servers. Each index creation
is parallelized across multiple processors.
- Create distributed views on top of partitioned tables. These views make the system
flexible enough to scale to very large volumes of approximately 27 terabytes.
- Exploit the flexibility of SSIS. SSIS can be invoked externally via command line
switches; this allows ICE to spawn multiple package instances based on the volume
of incoming data.
- Use Performance Monitor (perfmon) to test server performance elements such as disk
throughput.
- Enable cache memory on the SAN.
Future ICE Development
The success of ICE 3.0 has led to plans for additional investment to increase the
size and scope of ICE.
Planning for ICE 4.0 has already begun; a new SAN system has been ordered with an
increased 10 terabytes of storage space.
Design considerations for ICE 4.0 include the following:
- Handle 20 additional types of events, such as DHCP server events and telephone logs.
- House new event data in a consolidated ICE database to centralize forensic investigation,
eliminating disparate sources.
- Manage an increase in input volume of 4 to 5 terabytes per day.
- Reduce the total number of tables and views; for example, have one table for on
time data, another table for laggard data, and a view that includes both.
- Examine the approach to data population, perhaps building the logic into the Load
Balancer.
- Use instances of SSIS to merge new logs into the staging table.
- Provide more standardized reports.
- Exploit features in the next generation of SQL Server.
- Potentially dedicate 80% of SAN cache memory to write activity.
With an increase in the types of events being captured, the ICE database size will
increase from approximately 27 terabytes to an estimated 40 terabytes, creating
one of the largest SQL Server database implementations in the world.
Development of ICE 4.0 will begin in early FY08.
Conclusion
The Information Security Consolidated Event Management System 3.0 project achieved
its goals of addressing the issues faced by the preceding SQL Server 2000-based
system, ICE 2.0, in the areas of reliability, data capacity, and performance. ICE
3.0 provides an excellent example of the VLDB capabilities of SQL Server 2005, maintaining
a single database instance of approximately 27 terabytes with the ability to scale
further. By exploiting features introduced with SQL Server 2005, such as SSIS, ICE
3.0 handles significant volumes of data, reliably processing more than 550 GB per
day and more than 9 billion events per month. Using stage and partition tables,
ICE 3.0 provides a highly organized data management solution that supports concurrent
data loading and reporting without sacrificing end-user performance.
For More Information
For more information about Microsoft products or services, call the Microsoft Sales
Information Center at (800) 426-9400. In Canada, call the Microsoft Canada information
Centre at (800) 563-9048. Outside the United States and Canada, please contact your
local Microsoft subsidiary. To access information through the World Wide Web, go
to:
http://www.microsoft.com
http://www.microsoft.com/technet/itshowcase
This is a preliminary document and may be changed substantially prior to final commercial
release of the software described herein.
The information contained in this document represents the current view of Microsoft
Corporation on the issues discussed as of the date of publication. Because Microsoft
must respond to changing market conditions, it should not be interpreted to be a
commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy
of any information presented after the date of publication.
This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES,
EXPRESS, IMPLIED, OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user.
Without limiting the rights under copyright, no part of this document may be reproduced,
stored in or introduced into a retrieval system, or transmitted in any form or by
any means (electronic, mechanical, photocopying, recording, or otherwise), or for
any purpose, without the express written permission of Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other
intellectual property rights covering subject matter in this document. Except as
expressly provided in any written license agreement from Microsoft, the furnishing
of this document does not give you any license to these patents, trademarks, copyrights,
or other intellectual property.
© 2007 Microsoft Corporation. All rights reserved.
Microsoft, Microsoft SQL Server 2005™,
Microsoft Windows Server 2003® and Microsoft SQL
Server 2000™ are either registered trademarks or trademarks of Microsoft
Corporation in the United States and/or other countries.
All other trademarks are property of their respective owners.
Appendix A: Performance Comparison of SSIS and BCP
During the proof of concept stage of this project, various methods of loading data
were tested: SELECT INTO, SSIS, and BCP. SSIS was determined to be faster than BCP,
particularly when working with larger file sizes.
Table A1. BCP and SSIS Data Processing Times
|
File Size
|
BCP (Time Required)
|
SSIS (Time Required)
|
|
200 MB
|
26 seconds
|
10 seconds
|
|
2.17 GB
|
19 minutes
|
10 Minutes
|
|
4.34 GB
|
n/a
|
18 minutes
|
Appendix B: Processing Values from Production ICE 3.0 System
The following tables and graphs display processing data for the week of May 1-7,
2007.
Table B1. Data Processed (GB)
|
Date
|
On Time Firewall Proxy Data
|
On Time Web Proxy Data
|
Total On Time Data
|
Laggard Firewall Proxy Data
|
Laggard Web Proxy Data
|
Total Laggard Data
|
Total Data Processed (GB/Day)
|
|
1-May
|
341
|
300
|
641
|
190
|
158
|
348
|
990
|
|
2-May
|
343
|
326
|
669
|
180
|
141
|
321
|
990
|
|
3-May
|
400
|
320
|
720
|
190
|
147
|
337
|
1058
|
|
4-May
|
359
|
287
|
647
|
140
|
128
|
268
|
914
|
|
5-May
|
305
|
323
|
627
|
150
|
156
|
306
|
933
|
|
6-May
|
178
|
113
|
291
|
130
|
107
|
236
|
528
|
|
7-May
|
162
|
90
|
251
|
115
|
125
|
240
|
491
|
.jpg)
Figure B1. Data Processed (GB)
Table B2. Processing Time Taken (Hours)
|
Date
|
On Time Firewall Proxy Data
|
On Time Web Proxy Data
|
Laggard Firewall Proxy Data
|
Laggard Web Proxy Data
|
Maximum Time Required
|
|
1-May
|
18
|
14
|
23
|
23
|
23
|
|
2-May
|
18
|
14
|
23
|
19
|
23
|
|
3-May
|
19
|
14
|
23
|
18
|
23
|
|
4-May
|
18
|
15
|
24
|
20
|
24
|
|
5-May
|
18
|
15
|
22
|
20
|
22
|
|
6-May
|
15
|
7
|
26
|
17
|
26
|
|
7-May
|
12
|
7
|
21
|
18
|
21
|
.jpg)
Figure B2. Processing Time Taken (Hours)
Appendix C: Database Growth Over Time
The following graph illustrates how the data in the ICE 3.0 database grows over
a 90-day period and illustrates why data is retained in ICE 3.0 for a finite period.
.jpg)
Figure C1. Database Growth (GB)
Appendix D: LUN Best Practices
LUNs
Basic Disks
Microsoft IT Utility Services supports clusters within their IT best practices;
clusters on Windows Server 2003 are not supported on dynamic disks. Therefore, SAN
configurations are housed on basic disks. The features that are provided by dynamic
disks are also provided at the hardware level in the SAN hardware controller; therefore,
dynamic disks are not used.
LUN Size and Performance
Microsoft IT Utility Services best practice is to use a larger number of smaller
LUNs when creating a SAN. The reasons for this best practice include the following:
- CHKDSK scans take longer to complete when they are run on volumes with a large number
of files. By keeping the LUNs smaller and increasing their number, CHKDSK scans
run more quickly.
- Reduces the system effect of a single LUN failure.
- Windows queues disk I/O on a per LUN basis. Therefore, having a larger number of
smaller LUNs is a practical strategy for optimal performance.
For more information about how to plan for optimal I/O on SQL Server 2005, refer
to the Microsoft TechNet white paper, "Predeployment I/O Best Practices, SQL
Best Practices Article," at
http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/pdpliobp.mspx.
For more information about how to support large logical disks on Windows, refer
to the Windows Hardware Developer Central article, "Large Logical Unit Support
and Windows Server 2003 SP1," at
http://www.microsoft.com/whdc/device/storage/LUN_SP1.mspx.