Physical Database Storage Design

Published: June 26, 2006 | Updated : February 12, 2007

SQL Server Technical Article

Writers: Kathy Lu, Lewis Bruck

Technical Reviewer: Robert Dorr, Paul Randal, Conor Cunningham, Wei Xiao, Don Vilen, Kangrong Yan, Peter Byrne

Project Editor:

Designer:

Published: June 2006

Updated: February 2007

Applies To: SQL Server 2005 SP1

Summary: This article provides a guide for physical storage design and gives recommendations and trade-offs for physical hardware design and file architecture.

On This Page

Introduction
Executive Summary
Database File Storage Architecture
Physical Server System Design
Different Types of Workloads and Their I/O Requirements
Conclusion
Appendix A: The System Components
Appendix B: SQL Server 2005 Disk Usage

Introduction

This physical storage design guide is written to help database architects and administrators configure Microsoft SQL Server 2005 systems for optimal I/O performance and space utilization. This article also emphasizes the new SQL Server 2005 features that are significant to these discussions.

This paper is separated into three sections. The first section focuses on the database file storage design with the main focus on the new SQL Server 2005 features. The second section describes the design considerations of the physical hardware of the server system. This includes topics about system components such as disks, interfaces, buses, and RAID levels, and how each of these hardware components measure up with respect to the design criteria. The third section reviews the different types of workloads and the I/O requirements for various sizes of applications.

Executive Summary

Physical Database Design Steps

Designing a physical database requires that you consider multiple aspects and make sure each design decision integrates well with each other. The following are the recommended ordered steps to take when you approach this task.

  1. Characterize I/O workload of application. For more information, see the “Characterizing Application Workload” section later in this topic.

  2. Determine reliability and performance requirements for the database system.

  3. Determine hardware required to support the decisions made in steps 1 and 2.

  4. Configure SQL Server 2005 to take advantage of the hardware in step 3.

  5. Track performance as workload changes.

Recommendations

The following is the list of recommendations that are made throughout this paper. Detailed information is provided in later sections of this paper.

  • Always use Page Checksum to audit data integrity.

  • Consider using compression for read-only filegroups for higher storage efficiency.

  • Use NTFS for security and availability of many SQL Server 2005 features.

  • Use instant file initialization for performance optimization.

  • Use manual file growth database options.

  • Use partitioning (available in Enterprise Edition) for better database manageability.

  • Storage-align indexes with their respective base tables for easier and faster maintenance.

  • Storage-align commonly joined tables for faster joins and better maintenance.

  • Choose your RAID level carefully. For more information, see Table 1A in Appendix A later in this paper. For excellent performance and high reliability of both read and write data patterns, use RAID10. For read-only data patterns, use RAID5.

  • For optimized I/O parallelism, use 64 KB or 256 KB stripe size.

  • For future scalability and easy of maintenance, use volume mount points.

  • To increase bus bandwidth reliability, use multipathing software.

  • For small servers with less than three disks performing mostly sequential I/O, or servers with approximately eight disks performing random I/O, PCI is sufficient. However, PCI-X is recommended and can service a wider range of servers with varying workload size.

  • Directly attached I/O is recommended for small- to medium-sized servers.

  • SAN systems are recommended for larger servers.

  • NAS systems are not recommended. Use iSCSI instead.

  • For better recoverability, use a SCSI interface instead of SATA and IDE.

  • For larger server loads, use SCSI or SATA with TCQ support.

  • Store transaction logs separate from data files. Do not stripe on the same disk as the data files.

  • For large bandwidth demands on the I/O bus, use a different bus for the transaction log files.

  • The number of data files within a single filegroup should equal to the number of CPU cores.

Database File Storage Architecture

Before you decide on the hardware configurations, such as whether to use RAID5 or RAID10, how many spindles to use, or what interface to use, you should determine how the database file architecture will be configured.

This section describes the new SQL Server 2005 features that are intended to improve many aspects of database storage design.

New SQL Server 2005 Features

In SQL Server 2005, there are several new features that are geared towards improving the performance, reliability, availability, capacity, and manageability of database files. There are also a few features that SQL Server 2005 DBAs must know about that could affect physical design choices.

The new features in the spotlight are the following:

  • Page checksum and page-level restore combination

  • Read-only filegroups on compressed drives

  • Instant file initialization

  • Database snapshots

  • Row-level versioning

  • Partitioning

Page Checksum and Page-Level Restore

With the introduction of the page checksum feature, SQL Server 2005 can help increase data protection. The page checksum feature can detect disk I/O errors that are not reported by the operating system or the underlying hardware. Page checksums are created and verified for both the data and the log blocks when PAGE_VERIFY_CHECKSUM is enabled for the database. Page checksum is the preferred alternative to the previous TORN_PAGE_DETECTION option and is on by default. This option can be changed by using ALTER DATABASE ... SET PAGE_VERIFY.

A checksum of the page is calculated, verified, or both at these times:

  • Checksum is calculated when a page is written to disk from the buffer pool.

  • Checksum is calculated and verified when a page is read from disk into the buffer pool, provided that the page was previously written to disk with a checksum and the PAGE_VERIFY option is enabled.

  • Checksum is calculated and verified during BACKUP and RESTORE commands.

    • During a BACKUP DATABASE operation, page checksums are verified and a backup checksum is created on the backup media when the backup CHECKSUM option is specified. The default behavior is NO_CHECKSUM.

    • By default, during RESTORE DATABASE and RESTORE VERIFYONLY operations, checksums are verified when the backup checksum is present on the backup media. If the overall checksum of the backup stream is not present in the backup media, no verification is performed.

      • By default, if the CHECKSUM option is specified and the backup media contains the checksum, checksum verification is performed. If the CHECKSUM option is specified and no checksum is present on the backup media, then the restore operation fails and returns an error message that indicates the absence of the checksum.

      • If the NO_CHECKSUM option is specified, the checksum validation is disabled.

The checksum is stored in the page header that already exists, so no extra storage is required. Although there is a slight performance cost in the calculation of the checksum, we strongly recommend using page checksum verification at all times to help preserve data integrity.

When a page checksum error is detected, the server returns an 824 error:

SQL Server detected a logical consistency-based I/O error: <error type>. It occurred during a <operation> of page <pageid> in database ID <dbid> at offset <file offset> in file <fileid>. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

The DBA can restore the affected pages by using online page-level restore. This new feature is available in SQL Server 2005 Enterprise Edition. Online page-level restore gives the server better availability in two ways:

  • Restoring less data. This reduces the time required to copy and recover.

  • Restoring only a small part of the database. This leaves the majority of the database tonline for continued use.

For more information about page checksum and page-level restore, see “Detecting and Coping with Media Errors” in SQL Server Books Online.

Data Compression for Read-Only Filegroups

SQL Server 2005 offers support for read-only filegroups on compressed drives. This helps to improve capacity utilization, manageability, and I/O performance. Any secondary filegroup of a user database that is marked as read-only can be placed on compressed NTFS volumes. If the entire database is set to read-only, all filegroups can be placed on compressed volumes.

Compressed read-only filegroups prevent updates while saving disk space by compressing these files. From a manageability standpoint, when the filegroup is set to read-only and compressed, there is almost no maintenance required for this filegroup. You should still perform some DBCC CHECK activities on these volumes occasionally to detect possible media deterioration. Because the files in the filegroups are compressed and there is less data to read from the disk, I/O performance frequently improves. However, CPU performance might decrease because of the processing time that is required.

An internal performance test for compressed vs. uncompressed read-only file shows this I/O vs. CPU performance tradeoff:

  • The test was performed on a quad Xeon 700MHz processor machine that was running Windows Server 2003.

  • The test used a dynamically striped volume of five physical disks at 10,000 RPM.

  • The test performed a scan by using a simple SELECT COUNT (*) statement on a table of approximately 90,000 rows of data on 3.5GB read-only file.

  • The compression ratio was 8.0 to 1.

The observed counters and results of this test are shown in Table 1.

Table 1:  Test Results for Scan on Compressed and Uncompressed Read-Only Filegroup

Test Scenario

Average CPU

Processing Time

Average Physical

Disk Bytes/sec

Average Elapsed Time

File Size

Uncompressed file

Spikes at 42% at the beginning.

Remains below 16% the remainder of the time.

Maintains 14 MB/sec uniformly for all 5 physical disks

58 seconds

3.5 GB

Compressed file

Spikes at 40% at the beginning.

Remains below 20% the remainder of the time.

Maintains 750 KB/sec uniformly for all 5 disks

105 seconds

.44 GB

Note that the I/O vs. CPU performance tradeoff scenario might be different for different workloads and environments.

This feature is particularly useful in situations where large portions of the database contain historical information, are used only for analysis or forecasting, and there is limited disk space.

For more information about how to convert a filegroup to read-only/compressed is in “Read-Only Filegroups” in SQL Server Books Online.

For scalable shared database systems, data compression on read-only volumes is a practical option. For more information, see “Overview of Scalable Shared Databases” in SQL Server Books Online.

Instant File Initialization

SQL Server 2005 added support for the Windows instant file initialization feature. By default, when a file is created or grows larger, the pages in that file are initialized by writing zeros before the file gets used. This overwrites any existing data that remains on the disk. Instant initialization is only used for data files (not log files) and is enabled when the account running SQL Server has the Windows SE MANAGE VOLUME NAME privilege, which is available only on Microsoft Windows XP, Windows Server 2003 or later versions. This occurs in five scenarios:

  • During file creation

    • CREATE DATABASE, including tempdb creation at server startup.

    • RESTORE DATABASE

  • During file modification

    • ALTER DATABASE...MODIFY FILE.

    • Modifications that result in autogrow activity.

The zeroing process can affect performance, especially during modifications that trigger an autogrow. SQL Server 2005 offers the instant file initialization feature. This skips zeroing out of data pages. Instant file initialization reduces time during the creation of very large databases and tempdb.

The previous data on newly allocated space is never zeroed. The previous data gets overwritten with new data when the data page is actually allocated. Avoiding zeroing out applies to any application that is running with the Windows SE_MANAGE_VOLUME_NAME privilege, and is making the call to the SetFileValidData Windows API.

Note: By default, if SQL Server is running under the Network Service account, this permission is default OFF.

For more information, see “Database File Initialization” in SQL Server Books Online.

There are conditions where the old data can be located by an application designed to use the extended privilege. Any application taking advantage of the SE_MANAGE_VOLUME_NAME privilege can uncover old data on disk without requiring SQL Server.

SQL Server provides strong ACL security but it is always good practice to ensure these files have restrictive ACLs. The types of files and its respective ACLs are displayed in Table 2.

Table 2:  Potentially Exposed-Unwritten-Data Scenarios and the Protecting DACLs

Scenario

ACLs

Database files (.mdf, .ndf) when SQL Server is shut down

  • The ACLs remains with the Administrator group and Service Account that is running SQL Server.

Detached database files

  • If the database auto_close option is turned off, the ACLs are set to the Windows account that performed the database detach.

  • If the auto_close option is on, the ACL does not change and remains with the Service Account and the Local Administrator Account.

  • Note that only sysadmin accounts can detach a database.

Backup database files

  • The backup database file ACLs are inherited from the ACLs of the parent folder that contains those files. This is set by the sysadmin, db_owner, or db_backupoperator that specifies the location of the backup file.

Note: SQL Server tracks database allocations and never exposes un-initialized portions of the data file to a user. It is also not possible to simply copy or access a SQL Server data file and obtain the un-initialized data. By default, the operating system exposes un-initialized data as all zeros during any read operation. To obtain the raw data, the process must have the proper security account, privileges, and access tokens.

We recommend using instant file initialization for performance reasons. In-house testing for creating and growing files shows a significant improvement in performance when instant file initialization is used. This test was performed on two different servers:  

  • The smaller server, named “Server A” in Table 2, was a single processor computer that was running Windows Server 2003 and using direct-attached IDE to a Windows Basic volume. This volume consisted of one, 120 GB physical disk running at 7200 RPM using the Intel 82801EB Ultra ATA storage controller.  

  • The larger server, named “Server B” in Table 2, was a quad-core computer running Windows Server 2003 and using Ultra320 SCSI attached to a Windows Dynamic volume. This volume consisted of five striped physical disks of 36.4 BG each running at 15,000 rpm.

Because Server A is a smaller computer, it ran tests using 100 MB files. Server B ran tests with 1 GB files. When Server B ran with 100 MB increments, the instant file initialization performance impact is diminished due to the disk cache store on the server, compensating for the time it takes to physically write the zeros. The performance results of this test are shown in Table 3.

Table 3:   Performance Comparison of Instant Initialization vs. Regular Initialization

Server

Scenario

Instant Init (s)

Regular Init (s)

Difference (s)

Server A

Adding new 100 MB file

0.2

2.8

2.6

Server A

Modify file size to grow 100 MB

0.1

2.2

2.1

Server B

Adding new 1 GB file

0.1

15.2

15.1

Server B

Modify file size to grow 1 GB

0.1

18.5

18.4

Although instant file initialization provides a faster file growth, it should not be used as substitution for proper file planning. File autogrow allows for automatic file maintenance, but the default of 10% growth might not apply to all database scenarios. It is a good practice to fine-tune the autogrow size of files in accordance with the specific needs of that database. It is also a good practice to manually set the autogrow size for data files, and/or manually grow files in preparation for database growth to avoid adverse effects on performance. For more information about how to determine the size of autogrow and the size of files, see SQL Server 2000 Operations Guide: Capacity and Storage Management. This information is still relevant in SQL Server 2005 for file sizing and autogrow sizing.

Database Snapshots

By using the new database snapshot feature, SQL Server 2005 users can capture a persistent snapshot of a database at a point in time that can be accessed by multiple users. Database snapshots have many applications, including point-in-time database views for reporting purposes, viewing mirrored databases, and recovering from user errors. It is important for the DBA to understand the I/O requirements in terms of space and performance of this feature.

Storage Space: Snapshots take up disk space. However, compared to a full copy of the database, snapshots are generally more space efficient. It uses the sparse file technology provided only by the NTFS file system to maintain its efficiency.

There is one snapshot file for each data file in the source database. A page is copied to a snapshot file when it is updated in the source database. This mechanism is referred to as copy-on-write. The combinations of the pages in the snapshot and those that have not changed in the source database give a consistent view of the database at the time the snapshot was taken. Remember that if all pages in the source database are updated, the snapshot file can grow to a similar size as the source database file.

I/O Performance: I/O Performance is reduced when making a first-time update to a page on the source database because this leads to an extra write of the source page to the snapshot database. However, any subsequent changes to the copied page do not incur this extra write. Therefore, if a source database experiences only localized updates to certain pages, a snapshot database will have little effect on the I/O performance and storage efficiency. However, if there are many database snapshots of the same source database, a single modification to the source database can cause a ripple of several writes to each appropriate database snapshot.

Online DBCC CHECK: Database snapshots are created for online DBCC CHECK activity. The actual check is performed on the database snapshot to allow for concurrent access of the original database. This snapshot is created internally and stored on the same volume as the original database. The snapshot therefore requires physical disk space, possibly up to the same amount of space as the original table. Because the underlying database snapshot is created without user specification of physical location, and the volume might not have enough space, one of two things can happen:

  • ONLINE DBCC CHECK might revert to using DBCC CHECK WITH TABLOCK. This means that table locks will be taken on the original database instead of using the internal database snapshot. A short-term exclusive lock is also taken on the database. Running DBCC CHECK with TABLOCK reduces the concurrency that is available on the database during the DBCC CHECK activity.

  • DBCC CHECK might fail to run because it cannot create the internal database snapshot and cannot obtain the exclusive database lock.

Creating the internal database snapshot might cause problems if the volume is already full, or if the volume becomes full by creating the internal database snapshot. To avoid this, manually create a database snapshot of the source database, and then run DBCC CHECK on the snapshot. Manually creating the database snapshot lets the administrator create the snapshot in a volume with ample disk space. DBCC CHECK will run directly on the manually created snapshot, and will not create another internal database snapshot.  

Row-Level Versioning

SQL Server 2005 includes a new technology called row-level versioning to support better concurrent access. This technology is integrated in the following new features:

  • Snapshot Isolation

  • Multiple Active Result Sets (MARS)

  • Online Index Operations

    Note   Using row-level versioning requires storage space on tempdb and affects the I/O performance of tempdb I/O performance during both read and update. This is a result of the extra I/O work that is required to maintain and retrieve old versions. For more information on tempdb I/O performance, see the white paper .

When a row in a table or index is updated, the old row is copied to the version store (which exists in tempdb), and then is stamped with the transaction sequence number of that transaction performing the update. The new record holds a pointer to the old record in the version store. The old record might have a pointer to an even older record, which could point to a still older record, forming a version chain.

Versions can be created in the following scenarios:

  • In Snapshot Isolation for all updates and deletes.

  • When a trigger accesses modified rows (update/delete). This excludes instead-of triggers.

  • When a MARS session modifies a row (update/delete/insert) during pending reads.

  • For any online index build operations where there are concurrent DML operations on the same object.

Fore more information, see SQL Server 2005 beta 2 Snapshot Isolation, Database Concurrency and Row Level Versioning in SQL Server 2005, and “Understanding Row Versioning-Based Isolation Levels” in SQL Server 2005 Books Online.

Data Partitioning

The partitioning feature that is available on SQL Server 2005 Enterprise Edition might have the greatest influence on the physical database design. Partitioning horizontally divides the data (either tables or indexes) into smaller segments, or partitions, which can be maintained and accessed independently from each other. This improves manageability for very large or growing tables compared to the older method of partitioned views in SQL Server 2000.

Partitioning requires partitions to be assigned to filegroups, so many of the maintenance and performance gains of filegroups can be taken advantage of with a partition scheme. For more information about maintenance and performance gains, see Storage Engine Capacity Planning Tips.

Manageability

Partitioning in SQL Server 2005 replaces the older Partitioned View feature in SQL Server 2000. It offers faster query compilation, explicit errors, partition elimination during query processing, and uniform indexing on a single table. The main advantage is its manageability. By splitting large tables into smaller chunks, each chunk can be configured and managed differently according to each partition’s I/O needs. However, the single table entity is still maintained for easy administration. Here are a few of the ways data partitioning makes database management easier while still providing availability of the table during partition maintenance operations:

  • Partitions can be merged and split easily when configured in a sliding-window scenario. When the partition is in a sliding window configuration, data can be managed easier by splitting, merging and switching the time-sliced partitions. For more information, see Partitioned Tables and Indexes in SQL Server 2005.

  • Data movement in and out of partitioned tables can be easily performed with little loss of availability by using the ALTER TABLE ... SWITCH statement. There are three ways data chunks can be moved:

    • Switching a partition from one partitioned table to another

    • Switching a table into a partition as part of a partitioned table

    • Switching a partition out of a table to become its own separate table.

    These three methods give flexibility in managing partitioned tables in ways that allow manipulation of data while keeping the target table available. Because there is no physical data movement, only metadata change, the partitions and tables involved in the switching are required to be homogeneous. They must have the same columns of the same data type, name, order, and collation on the same filegroup. There are additional constraints that align the tables. For more information, see the Alignment and Storage Alignment section later in this paper.

    To determine the specifics of partition switching, see “Transferring Data Efficiently by Using Partition Switching” in SQL Server Books Online.

  • Backup and restore can be done on a partition granularity basis if the partitions reside on different filegroups. Alternatively, the database can be restored by using the new piecemeal restore feature. Piecemeal restore allows restoring of the backed up database in stages. For more information, see “Performing Piecemeal Restores” in SQL Server Books Online.

Alignment and Storage Alignment

When two objects are partitioned in the same way, they are described as being aligned. The two objects do not necessarily have to use the same partitioning function to be aligned. The two partition functions require the following:

  • The same data type for the arguments of the partition functions

  • The same number of partitions

  • The same boundary values for the partitions

When two objects are aligned as described above and also have the same partitioning scheme, they are considered storage aligned. Again, the two objects do not need to share the same partitioning scheme but must reside on the same filegroups in the same partitioned order. For more information, see Partitioned Tables and Indexes in SQL Server 2005.

Storage alignment can be further specified as physical or logical. Physical storage alignment occurs when the filegroups holding the partitions reside on the same specified physical disk drives. Logical storage alignment occurs when the filegroups reside on the same logical disk drive which can actually consist of multiple physical disks.

These two concepts are important because storage alignment can provide performance improvements for queries, and for the merge, split, and switch partitioning management features.

We recommend that the partitioned non-clustered index be logically storage-aligned to the base heap table or clustered index. There are two benefits to this design:

  • During index creation, it eases the I/O and memory load because sort tables are created one at a time instead of all at once. For non-aligned indexes the sort tables get created all at once. This can also affect any queries that require SQL Server to perform data sorting. For more information, see “Special Guidelines for Partitioned Indexes” in SQL Server Books Online.

  • During index or table maintenance, it allows partitions to be switched in and out of the table quickly while preserving the partition structure of both the table and index. If indexes are not also aligned, switching does not work.

In general, alignment and storage alignment are recommended for both the base table and any indexes. However, there are specific situations where a partitioned index might not benefit from being aligned with the base table:

  • The base table itself is not partitioned

  • The index key is not associated with the partitioning column of the base table.

  • The table is frequently involved in collocated joins with other tables by using different join columns. Collocated joins are joins between two aligned objects where the join predicate is on the partitioned table. The non-clustered indexes can be partitioned on different columns to participate in more collocated joins with other tables. For more information, see Partitioned Tables and Indexes in SQL Server 2005.

Aligned Partitioning and Disk Configurations

In the following example, there are two tables, T1 and T2. Each table has five partitions and 25 available physical disks to be grouped in any logical volume configuration. The examples show four configurations for these two tables.

1. Simplest Configuration: Single Filegroup for Partitioned Tables

Figure 1 shows the five partitions of both tables using the same filegroup, FG1. The two tables are aligned; however, all the partitions reside on one filegroup. The filegroup resides on the logical volume U:\, which holds all 20 physical disks. The major limitation to this configuration is that this layout does not leverage data separation advantages because each partitioned data chunk resides in the same filegroup. For example, if a certain partition requires read-only access, this cannot be set on the filegroup

Although this configuration has many limitations, it is useful in a test environment, where the functionality is most important and further I/O tuning of the database is unnecessary.

Figure 1: Single Filegroup for Partitioned Tables

Figure 1: Single Filegroup for Partitioned Tables

2. Segregated Configuration: Partitioned

Figure 2 shows that T1 and T2 are-storage aligned with each of their five partitions assigned to separate filegroups. Partition 1 of T1 and T2 are both on FG1, partition 2 of T1 and T2 are both on FG2, and so on. Each of the logical volumes: I:\, J:\, K:\, L:\, and M:\ have four physical disks each. The advantages for this configuration include the following:

  • Better separation of the data since each partition is in its own filegroup. This allows for specialized configuration of each partition.

  • Better I/O parallelism at the table level because the partitions are split into separate filegroups, and filegroups reside on different disks. However, if the five logical disks are striped with four underlying physical disks, the degree of I/O parallelism at the partition level is limited to a maximum of four disks.

    Cc966414.PhysDBFig2(en-us,TechNet.10).jpg

    Figure 2: Segregated Disks

3. RAID10 Configuration

Figure 3 shows T1 and T2 as storage-aligned with all five partitions of both tables in separate filegroups. This layout is very similar to the previous design, except that all the filegroups exist on one logical disk consisting of all 20 physical disks with RAID10.

Note   RAID10 and other RAID levels are described in Appendix A.

In this scenario, the I/O parallelism can be used to its fullest by all partitions. Therefore, distribution of I/O workload is among 20 physical spindles instead of four at the partition level.

Figure 3: Aligned Partitioned Tables on RAID10

Figure 3: Aligned Partitioned Tables on RAID10

4. Specialized RAID System Configuration: Figure 4 has a layout that is very similar to the RAID10 configuration in Figure 3 and has the same benefits. The only difference is that the partitions on FG4 and FG5 are read-only, compressed, and on another logical disk with RAID5. This configuration allows the heavily accessed read-write data to be on a high performance, highly reliable I/O system, while the read-only data is put on a less expensive RAID system that has sufficient performance for read-only data. This design allows for finer tuning of the needs of certain data to the appropriate I/O system.

Figure 4: Specialized RAID System Configuration

Figure 4: Specialized RAID System Configuration

Physical Server System Design

This section describes the design choices to consider when building a database system. Information about disk and volume configurations, the available choices for system components, and how each component measures up to certain design criteria are included in this section.

The Considerations for Disk and Volume Configurations

This subsection describes the various disk and volume configurations to consider when building a database system.

Sector Size vs. Stripe Size

The sector size is the smallest physical storage unit on the disk. The sector size is a fixed size set by the disk manufacturer, currently at 512 bytes. The stripe size refers to the unit of data that is written and accessed from a disk in a RAID system. This is a configurable value that is set when designing the storage array system. A smaller stripe size allows data to be distributed to more disks and increase I/O parallelism. Note that the stripe size of a single SQL Server extent (64 KB) is the lower limit. For the same data, a larger stripe size means the data can be stored on fewer disks and decrease the I/O distribution and the degree of parallelism. We recommend a 64 KB or 256 KB stripe size for most workloads. When the workload includes table and index range scans on tables that are larger than 100 MB, a stripe size of 256 KB allows for more efficient read-ahead.

Physical Disks vs. Windows Basic Volumes

Physical Disks are the actual hardware disk drive units that are presented to Windows. The partitions and logical drives on these physical disks are described as Windows volumes. Existing Windows volumes can be extended to a larger size by extending the volume onto unallocated space on the same physical disk or to a different disk. When the volume is extended across multiple disks, it is described as a spanned volume. There are two types of Windows Volumes: basic and dynamic. Dynamic volumes provide more features, such as software RAID implementations, than basic volumes. To understand more about Windows volumes, see Basic Disks and Volumes and Dynamic Disks and Volumes.

NTFS vs. FAT vs. Raw Partitions

NTFS and FAT are both file systems offered by the Windows operating system. FAT is the older technology with more limited file and security features than NTFS. The NTFS file system has many unique built-in features that SQL Server 2005 features use. Features that require the NTFS file system include, but are not limited to, the following:

  • Database snapshots (which use NTFS sparse files)

  • Online DBCC check functions (which use NTFS sparse files and file streams)

  • Compressed files

  • Instant file initialization

  • Mount points

  • File ACLs and other security features

Because NTFS is the newer technology, it is not available with older operating systems such as Windows 98 or DOS mode for dual booting scenarios. However, this is a small restriction because dual booting is not common for server scenarios. and SQL Server 2005 is not supported on the older operating system versions. FAT has a file limit of 4 gigabytes in size; in contrast, NTFS has a very large file limit of 16 Exabyte.

A disk partition with no file system or unformatted is referred to as a raw partition.  In some scenarios, raw partitions might give a slight performance increase compared to having a file system. However, Microsoft discourages raw partitioning because raw partitioning limits the common data access and recovery options that NTFS provides.

Volume Mount Points vs. Lettered Drives

Volume mount points or mounted drives are volumes that are attached to folder names on NTFS volumes, and that are assigned a label or volume name instead of a drive letter. This feature gives mounted points two advantages:

  • The mount points are not limited to 26 letters of the alphabet as lettered drives are.

  • The mounted drives are more protected against system changes that occur when adding or removing devices to the computer system.

Server Components and Design Criteria

This section describes various hardware components, and how the design criteria affect the choices of those components. For more information about the hardware components, see Appendix A.

Design Criteria

When considering the available design choices, there are some common criteria you should use to decide on the appropriate database design. There will be some trade-offs to evaluate during the design process, so it is a good idea to prioritize the design criteria. The following are a few design criteria considered in this paper:

  • Reliability/availability

  • Performance

  • Capacity or scalability

  • Manageability

  • Cost

Bus Bandwidth
Reliability/Availability

Larger bandwidth does not give significant increase in reliability for smaller systems. However, for medium to larger-sized servers, larger bus bandwidth does improve on the system’s reliability especially with added multi-pathing software. The bus bandwidth’s reliability is improved through the redundant paths in the system and by avoiding single-point-of failure in hardware devices. There are many multi-pathing software solutions on the market, including the Microsoft MPIO driver.

Performance

Larger bus bandwidth is absolutely necessary for improved performance. This is especially true for systems that frequently use large block transfers and sequential I/O. A larger bus bandwidth is also necessary to support a large number of disks.

Also keep in mind that the disk is not the only user of bus bandwidth. For example, you must also account for network access.

In smaller servers that use mostly sequential I/O, PCI becomes a bottleneck with three disks. For a small server that has about eight disks performing mostly random I/O, PCI is sufficient. However, it is more common for PCI-X to be found on servers ranging from small to very large. PCI-E is now commonly found on newer desktops and might soon be more widely accepted on small servers.

Capacity

The capacity of bus bandwidth might be limited by the topology of the system. If the system uses direct attached disks, the number of slots limits the bus bandwidth capacity. However, for SAN or NAS systems, there is no physical limiting factor.

Manageability

There is no particular management required for system buses.

Cost

Larger and faster buses are typically in the more expensive servers. There is often no way to increase the capacity of the buses’ bandwidth without replacing the servers. However, the largest servers are more configurable. Check with server providers for specifications.

Interfaces
Reliability/Availability

For reliability, SCSI proves to be better suited because it supports forcing data to be written to disk. This support gives a significant improvement in recoverability. IDE and SATA, however, do not uniformly support commands to force data to disk. This can impact recoverability after a hardware, software, or power failure. All the types of interfaces can support hot-swapping to allow repairs while maintaining availability.

Performance

In general, all of the previously described interfaces support high transfer rates:

  • IDE has high transfer rates only if there is one drive attached per channel.

  • Most SATA are explicitly designed to support only one drive per channel; however, multiple SATA channels of 2 to 12+ on interface cards are also available.

  • SCSI can have up to 15 drives per channel. However, overloading the channels increases the chance of hitting the transfer rate limit.

For individual I/O requests, it is recommended to use SATA or SCSI interfacing because IDE can only handle one outstanding I/O request per channel. This limitation on IDE interfaces is sufficient for small server loads. For larger server loads, SCSI and SATA with TCQ support multiple I/O requests, which works better with the performance abilities of SQL Server.

Note   For correct SQL Server support, IDE and SATA installations must always ensure that the caching mechanisms properly support forcing data to disk. For more information, see the “SQL Server 2000 I/O Basics” paper.

Capacity

IDE and SATA drives tend to have greater capacities than SCSI drives.  However, this can be worked around by using more drives.  

Also note that larger drives, all else being equal, means an increase in mean seek time.

Manageability

All three interfaces are fairly low in management costs. All interfaces have support for hot-swapping. The SCSI interface has an advantage over IDE and SATA because it is less restricting on its physical cable length.

Cost

IDE and SATA drives tend to be cheaper than SCSI drives per GB.

Disk Topology (Direct Attach/SAN/NAS)
Reliability/Availability

Direct attached topology has a larger probability of cable failure because it has more cabling than a SAN or NAS topology. However, because the SAN and NAS topologies have more hardware and the hardware is more complex, it can lead to a greater probability of hardware failure (such as ones caused by ESD, power surges, or misuse).  This is more concerning because data loss is usually associated with hardware failures.

Overall, all three topologies have capabilities for redundancy in cabling and hardware, and so can be considered to have no notable reliability differences when all other SQL Server I/O requirements are upheld.

Performance

Direct attached disks often have higher maximum bandwidth. This can be very important for bottlenecked workloads.

Capacity

For SAN or NAS topologies, there are no limitations on the number of disks that can be accessible. However, with direct attached servers, the number of disks is limited by the number of slots in that system and the type of interface used.

Manageability

Manageability changes with the number of servers in the system. For a system with a smaller number of servers, direct attach is easier to manage. As the number of servers increase, it becomes harder to manage. SANs and NAS systems provide a data-consolidated solution that makes it easier to install additional servers and manage the system when there are many servers involved. SAN and NAS systems also make it easier to reallocate disk storage between servers.

Cost

Initial overhead costs are much lower for direct attached systems than for the networked disk topologies. For larger systems, initial costs tend to be lower with NAS compared to SAN, depending on network infrastructure. This is because NAS systems usually already exist as part of the corporate network, whereas SANs are dedicated to a specific server solution. Adopting a SAN topology leads to a full investment on the system. Adopting a NAS system could simply mean adding more disks. Maintenance cost tends to be lower with NAS and SAN compared to direct attached systems due to its manageability.

RAID

Choosing the correct RAID level for your business data solution is a key design decision. For more information about RAID levels, see Appendix A. Also, see Table 1A in Appendix A for summarized RAID properties and recommended uses.

Reliability/Availability

RAID systems are designed for reliability through redundancy in disk arrays. RAID1, RAID5, and RAID10, provide protection against disk failures. Only RAID1 and RAID10 protect against multi-disk failures, RAID5 only protects against a single-disk failure. RAID0 provides no protection against disk failures because it only performs data striping.

Often, RAID controllers support “hot-spare” disks for automatic substitution in case of drive failure. When a drive fails, the controller automatically rebuilds the data on the hot spare disk. This can improve the reliability of the system.

Performance
  • Compared to RAID0, all other RAID levels have lower write performance, all else being equal, because RAID0 does not have redundancy. Note that some implementations of RAID1 do not always incur the write penalty.

  • All other RAID levels can match RAID0 performance on read operations. The RAID overhead in performance is incurred during write operations.

  • RAID5 can have much lower write performance than any other configuration because it requires extra reading and writing activities for the parity blocks in addition to reading and writing the data.

  • RAID10 yields excellent read-write performance.

  • When comparing software RAID vs. hardware RAID, hardware RAID solutions typically outperform software RAID because hardware RAID does not share CPU processing time for I/O requests with the operating system or other applications.

  • Many RAID controllers provide an additional onboard memory cache. Having additional cache can improve on performance. However, battery backup and UPS support are required to avoid negative impact on reliability due to a power failure.

Capacity

In general, the smallest drive in an array system can be the limiting factor for the capacity.  Hence, it is strongly recommended (sometimes even required by manufacturers) to have identical drives.

RAID0 has the best storage capacity efficiency because it does not use any disk for data redundancy. When the disks are identical, it has 100% capacity efficiency.

RAID1 has 50% capacity efficiency because half of its disks are used for mirroring the data. Recall that

RAID5 stores distributed parity bits, and so its capacity efficiency can be approximately calculated as:

(total number of drives – 1) / (total number of drives)

This assumes that all drives are identical.

RAID10 has 50 percent capacity efficiency because it is mirrored.

Manageability

There is no notable impact with the manageability of the different RAID levels.

Cost

Here are some cost comparisons of RAID:

  • Cost for RAID controllers are higher than for the simple controllers.

  • Cost of hardware RAID systems is greater than that of the software RAID systems, such as the one provided with Windows Server 2003.

  • Mirror-based RAID systems require more disks for redundancy than parity-based RAID systems to hold the same amount of data. Therefore, more hardware purchases are required for the disks. However, parity-based RAID systems require more expensive controllers than the mirror-based RAID systems.

  • The costs are higher for larger arrays.

Consider that these costs must be compared to the costs of data loss, data recovery, and availability interruptions in the situation that RAID is not used.

Disk Drives
Reliability/Availability

An important factor to consider when shopping for disk drives is reliability. In terms of reliability, disk drives have improved tremendously over the years.

In the manufacturer’s specifications, it commonly rates the drive’s reliability using an index called Mean Time Between Failures (MTBF). This index refers to the average amount of time before a second disk of the same type will fail after the first failure and applies to drives that are repairable. This number is calculated by testing a sample population of manufactured disks in a short period of time. Typical MTBFs are 500,000 to 1,200,000 hours or 57 to 138 years, respectively. These numbers seems to show reliability. However, keep in mind that MTBF is a predictive measure of reliability of a large population of drives and not a guarantee of reliability for any one particular drive.

Another measure of reliability is the Mean Time to Failure (MTTF). This is the average time to the first failure for non-repairable hardware. The typical values to MTTF are similar to those of MTBF.

However, a better indication of the hard drive’s reliability is the service life or the warranty period of the device.

The reliability of a disk array system should be considered rather than the individual disk drives. The probability of having a disk failure in the system increases as the number of disks increase. Assuming that the MTTF is a good enough estimate, to determine the system’s MTTF, use the following formula:

Cc966414.PhysDBFig5(en-us,TechNet.10).jpg

It is always a good idea to check for manufacturer’s specifications. Certain manufacturers specify “for desktop use only”; accordingly, these disks should not be used for server use.

In addition to the manufacturer’s specifications and warranty, disks usually come with environmental specifications about how they should operate for best reliability. For example, keep the system in an area with good ventilation and little traffic as heat and vibrations are the two leading causes for hard drive failures. A good indication of the amount of heat the unit produces is the power consumption. In general, the greater the power consumption, the greater the heat produced.

Performance

Disks with higher rotational speeds and lower seek times are typically better performers.  However, to find the optimal performance for a particular company’s business, it is a good idea to look at that servers’ type of workload. For servers that frequently perform sequential workloads, having a large disk cache is more important.

Important   Make sure that the cache is backed up by battery and UPS.

For random workloads, seek times and the number of drives are more important.

Capacity

Drives come in many sizes. Smaller drives are typically faster.

Cost

The cost of a drive typically increases with greater speed and capacity. However, seek time, MTBF, and branding can affect drive price.

Different Types of Workloads and Their I/O Requirements

This section describes how to characterize the workload and provides some examples of application workload and their configured systems. Trends and general recommendations are also included in this section.

Workload and I/O Activity Terminologies

Workloads and Storage Architecture

When looking at the types of storage architecture, two important aspects to consider are the type and size of the workload that the application puts on the server.

Type of Workload

The type of workload has a major impact on the I/O subsystem and is a very important consideration when designing the storage architecture. Two commonly known categories of workload are Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP).

Online Transaction Processing (OLTP)

OLTP is an application that operates with a high level of create, read, update, and delete activities, and with a large number of user connections. It is typical of production servers to have heavy continuous updates accessed by many users through simple queries that generate random accesses to the disk.

Online Analytical Processing (OLAP)

OLAP is an application that operates with high levels of mostly read activity, with fewer user connections that make more complex and demanding queries. This technology is commonly used in businesses that require management-critical data that is accessed by fewer users for analytical investigation. Although OLAP is mostly involved with read activities, periodically there is write activity that has to finish in a timely manner. Decision Support System (DSS) is another term used for OLAP.

Workload Size

The size of the workload depends on the expected number of user connections, the rate of transactions, the work done by the transaction and the size of the database. The relevant component for the work done by the transaction is how much of that work is read and written.

Types of I/O Activities

When describing an I/O subsystem, it is common to use the terms that are briefly described in this section.

Read and Write

Read and write operations differ in speed for different disk systems. A read operation generally refers to accessing the data from the hard drive and writing it to the buffer pool. The write operation generally refers to writing the data from the buffer pool onto the hard drive.

Random vs. Sequential

Random and sequential are two ways to access data on a disk. Random access refers to accessing the data that is scattered across the disk. Sequential refers to accessing the data in the order it is physically stored.

SQL Server Block Size

The SQL Server I/O block size refers to the data block size that SQL Server sends I/O requests. The minimum block size is 8KB, or one SQL Server page.  Other common block sizes are 64KB and 256KB or eight and 32 SQL Server pages, respectively.

Characterizing Application Workloads

The disk subsystem necessary to support an application depends on the I/O requirements of that application. To analyze the requirements, the I/O characteristics for each type of workload should be classified in the following ways:

  • Number of random disk I/Os to the data files.

  • Number of sequential disk I/Os to the data files.

  • Number of transaction logs generated.

  • Rate of execution of the operations by users.

The total number of disk I/Os determines the number of individual disks that are required. To establish good performance throughput, it is crucial to determine the appropriate number of disks. When determining the number of disks, use this rule of thumb: a single server-class disk can handle around 100-150 random I/Os per second and 30-50 MB/sec of sequential I/O comfortably. While the fastest disks can handle more than this, the conservative number provides “cushion” for handling peak workloads.

Here is an example that determines the number of spindles for a certain I/O workload.

A small business firm requires 8 MB/sec of random disk I/O to handle workload. Because SQL Server 2005 uses an 8 KB transfer size for random I/O operations, the I/O subsystem must be able to handle 1000 random disk I/Os per second. If one disk drive can do 150 I/Os per second, then seven disk drives are required to handle the workload. This calculated disk drive count is the minimum disk count before accounting for any overhead from a RAID system. The disk drive’s I/O per second (IOPS) can be measured by sqlio.exe. For more information about sqlio.exe, see How to use the SQLIOStress utility to stress a disk subsystem such as SQL Server.

Roughly, the formula to approximate the number of spindles is as follows:

# disk drives = ((bandwidth) / (transfer size)) / (IOPS)

Next, it is important to determine that the bus system can handle the I/O throughput. The required bus bandwidth is different for different types of workloads:

  • In an OLTP application with random I/O, the number of disk drives is usually much higher than the number of busses. A 1 GB fiber channel link can handle a maximum of 90 MB/sec. Because a single disk can perform 150 I/Os per second and each I/O operation is 8 Kb, which requires 1.2 MB/sec. of bandwidth. Therefore, 75 disk drives can be placed on the channel (90MB/sec divided by 1.2MB bandwidth).

  • In an OLAP application with heavy sequential I/O, the ratio of number of disks to number of busses is lower. For example, assume a single disk drive can perform 40 MB/sec of sequential reads. A single fiber channel link would not suffice with only three disks. This scenario is where direct-attached SCSI would be a good option. A single U320 SCSI bus can feed seven to eight disk drives.

Depending on the amount of logging for the different types of workload, the logs should be among the first to be considered for placing on a separate bus from the data. Segregating the log on a separate bus distributes some of the bus load.

When considering the types of workloads, recall that some special operations in SQL Server 2005 also use the disk subsystem:

  • Checkpoints: Even if frequently updated pages are kept in the buffer pool, the pages need to be written to disk at least once per minute by the checkpoint process to ensure quick crash recovery and to allow for the log space to be reused.

  • Backup: Backing up a database performs many large sequential reads from the data and log files, and many large writes to the backup destination.

  • Bulk loads: Bulk loading performs many large sequential writes to the data files.

  • Index creation: During index creation, there are many large sequential writes to the data files.  Depending on the type of operation, options specified and database recovery mode there might also be:

    • Many random reads and writes in tempdb.

    • Writes to the log files.

    • Random reads and writes in the database if the entire sort result set does not fit in memory.

  • Shrink/Defrag: There are many random reads and writes during shrink and defrag in the data files, as well as writes to the log files.

  • Checkdb: Checkdb executes many sequential reads in the data files.

Examples of Workloads and Configurations

Assume that you have a small, light-weight OLTP application. Each transaction involves ten 8 KB random I/Os and 10 KB of sequential log writes. To support 100 transactions per second, the data disks have to support 1000 random 8 KB I/Os per second or approximately 8 MB/sec (1000 × 8 KB) and the log disks would have to sustain 1 MB/sec (100 × 10 KB). These sums up to a 9 MB/sec total bus bandwidth. Seven disks that can handle 150 I/Os per second could handle the random I/O workload for the data, a single disk could handle the generated log, and relatively little bus bandwidth is required. The minimum number of disk drives required can be calculated by using the equation given in the previous subsection. A reasonable server configuration would have 12 data disks and two log disks (both using RAID1), all on a single SCSI channel.

Another scenario involves a data warehouse that is used by a small number of users. Each query involves one hundred 64 KB random reads and little log activity. To support 25 users who all want one-second response time, the system would have to support 2500 I/Os per second (25 users × 100) or 160 MB/sec (2500 I/O per second × 64 KB). A reasonable server configuration would have a dual-channel SCSI RAID controller on a PCI-X bus attached to a total of 28 disks using RAID5. Here, RAID5 is used because there is little write activity.

Now assume that you have a larger, heavyweight OLTP application where each transaction involves 200 – 8 KB random I/Os and a 400 KB of sequential log writes. To support 100 transactions per second, the data disks would have to support 20,000 random I/Os per second (100 × 200) or 160 MB/sec, and the log would have to sustain 40 MB/sec. The total bus would need to sustain 200 MB/sec (160 MB/sec + 40 MB/sec). A reasonable server configuration would have the server connected to a SAN via three fiber channel adapters on at least two PCI-X buses. The SAN would contain 250 disks on RAID1 for the data files and four disks using RAID1 for the log files.

Conclusion

For more information:

https://www.microsoft.com/technet/prodtechnol/sql/default.mspx

Appendix A: The System Components

This appendix defines and briefly describes system components that are relevant to this paper.

Bus Bandwidth

There are two factors that govern bus bandwidth:  

  1. The bus type

  2. The number of buses in the server

Bus Types

Typically, there are three types of buses commonly used in the industry: PCI, PCI-X (for eXtended) and PCI-E (for Express).

PCI (Peripheral Component Interconnect): A computer bus that transfers data between the motherboard and peripheral devices in a computer. PCI bus is clocked at 33.33 MHz with synchronous 32-bit transfers and peak at 133 MB/sec.

PCI-X (Peripheral Component Interconnect – Extended): The protocol for PCI-X has a faster clock rate than PCI, clocking up to 133 MHz with 32- or 64-bit transfers peaking at 1066 MB/sec.

PCI-E (Peripheral Component Interconnect – Express): This serial bus form factor uses the same PCI concepts but with different pin layout resulting in smaller slot lengths. It supports one or more serial signaling lane each transmitting up to 250 MB/sec.

Number of Buses on Server

For the total bus bandwidth, the number of slots filled is not necessarily the same as the number of buses in the server. This depends on the make and model of the server box. For more information, consult the specifications for the server.

Disk | RAID | SAN Interfaces

Interfaces

IDE ATA ( Integrated Drive Electronics Advanced Technology Attachment): IDE is a disk drive technology has an integrated controller, and is based on parallel data processing. Data travels asynchronously from the drive to the data bus.

SATA (Serial Advanced Technology Attachment):  SATA is a physical storage interface with serial data processing. Data travels synchronously on a clock edge. Transfer rates on SATA start at 150 MBps.

The main design advantage over IDE is the fact that SATA II supports command queuing known as Tagged Command Queuing (TCQ). Command queuing is good because it allows for servicing of out-of-order I/O requests. TCQ is an intelligent mechanism built into the host adapter that reorders requests to minimize the movements of the disk head assembly. The disk can take into account rotation and seek distances and serve the commands in a more efficient order, and then return the data to the operating system in the requested order. Requests are serviced according to their tagged modes:

  • Ordered: I/O commands are executed in the same order as the requested order.

  • Head of Queue: This tagged command gets serviced immediately after the current I/O command

  • Simple: Allows hard disk to control the ordering for optimized I/O activity.

Other advantages include its thin cable design and smaller form factor and length. These all help to allow for better airflow and heat dissipation. Also, the SATA cables can extend a longer distance (one meter) without data corruption compared to the 40 cm. limit of IDE.

SCSI (Small Computer System Interface): SCSI is a parallel interface used for attaching peripherals such as hard disks. In general, SCSI provides a faster data transmission rate of up to 320 MB/sec. There is another version of SCSI called Serial Attached SCSI (SAS). SAS combines the benefits of SCSI with SATA’s physical advantages listed above.

FC DAS (Fiber Channel Direct-Attached Storage): Fiber channel is a physical layer protocol enabling serial duplex interfacing to allow communications between high performance storage systems and performing up to 2 GBps. Fiber channels in DAS topology are directly attached to a server and are not openly accessible to other servers. FC DAS is not commonly used.

FC SAN (Fiber Channel Storage Area Network): Dedicated network that allows access between storage devices and servers on that network using the fiber channel technology (described above).

iSCSI (Internet Small Computer System Interface): An IP (Internet Protocol)-based storage network protocol used for linking data storage to servers. The iSCSI protocol transmits SCSI packets as IP packets. Because it is using IP, iSCSI is routable and can leverage any existing TCP/IP network. There are two terminologies to take note of in the iSCSI industry:

iSCSI Target: refers to the actual physical disks.

iSCSI Initiators: refers to the client that performs the I/O request to that particular iSCSI Target.

There are hardware and software versions of both iSCSI Initiator and iSCSI Target. The Microsoft iSCSI Software Initiator Package includes both the iSCSI Initiator service and the iSCSI Initiator software driver.

RAID

RAID stands for Redundant Array of Inexpensive (or Independent) Disks. It is a collection of disk drives working together to optimize fault tolerance and performance. There are various RAID levels, but only the RAID levels significant to SQL Server are described here.

RAID0 (simple striping): Simplest configuration of disks that stripe the data. RAID0 does not provide any redundancy or fault tolerance. Data striping refers to sequentially writing data in a round-robin style up to a certain stripe size, a multiple of a disk sector (usually 512 bytes). Data striping yields good performance because multiple disks are concurrently servicing the I/O requests. The positive points for RAID0 are the cost, performance, and storage efficiency. The negative impact of no redundancy can outweigh its positive points.

RAID1 (simple mirroring): This configuration creates an exact copy or mirror of all the data on two or more disks. This RAID level gives good redundancy and fault tolerance, but poor storage efficiency. To fully take advantage RAID1 redundancy, it is recommended to use independent disk controllers (referred to as duplexing or splitting). Duplexing or splitting removes single-point failures and allows multiple redundant paths.

RAID5 (striping with parity): RAID5 uses block-level striping where parity is distributed among the disks. RAID5 is the most popular RAID level used in the industry. This RAID level gives fault tolerance and storage efficiency. However, RAID5 gives a larger negative impact for all write operations, especially sequential writes.

RAID10 (stripe of mirrors): RAID10 is essentially many sets of RAID1 or mirrored drives in a RAID0 configuration. This configuration combines the best attributes of striping and mirroring: high performance and good fault tolerance. For these reasons, we recommend using this RAID level. However, the high performance and reliability level is the trade-off for storage capacity.

Note that in all levels of RAID configuration, the storage efficiency is always limited to a multiple of the smallest drive size.

Table 1A:  Summarized I/O Activity of RAID Levels and Their Recommended Use

RAID Levels

RAID0

RAID1

RAID5

RAID10

Reliability

Lowest.

Lack of fault tolerance results in data loss.

Very good.

Even better with duplexing.

Good.

Can tolerate single machine fault.

Excellent.

Storage Efficiency

100%

50%

>50%, <100%

(#drives -1/#drives)

50%

Random Read

Excellent

Fair

Worst of the RAID levels but better than a single drive.

Excellent.

Excellent.

Random Write

Excellent.

Fair.

Worse than a single drive but better than some RAID levels.

Fair.

Generally better with larger stripe sizes.

Very good.

Sequential Read

Excellent.

Fair.

Comparable to a single drive.

Very good.

Generally, better with smaller stripe sizes

Excellent.

Sequential Write

Excellent.

Good.

Better than other RAID levels.

Fair.

Very good.

Cost

Lowest.

Moderate.

Relatively high cost due to redundant drives; however, no expensive controller required

Moderate.

High.

Recommended use

Good for non-critical data or stagnantly updated data that gets backed up regularly or any data requiring fast write performance at very low cost. Great for testing.

Good for data that requires high fault tolerance at relatively low hardware cost (redundancy using parity requires more expensive hardware). Best for log files.

Very good for Read only data.

Data requiring high performance for both read and write and excellent reliability while trading off storage efficiency and cost.

Hardware vs. Software RAID:

There are two ways RAID is implemented: through hardware, and through software. The hardware implementation refers to the traditional method implemented with a collection of disk drives working together for fault tolerance through striping or mirroring of data, as described previously in this paper.  The software version refers to RAID implementation provided by the operating system.

Software RAID is usually the cheaper of the two solutions. However, there are a few drawbacks to using software RAID. The most significant drawback is that, because the RAID engine runs in kernel mode, it shares CPU time quanta with other kernel mode components and overlying applications. In contrast, the hardware RAID solution does not face this problem because the RAID firmware executes on a separate dedicated chip. This also allows for asymmetric multiprocessing between the system processor and the RAID controller.

Disk Controller Caching

Caches temporarily held data that has been recently read, or sometimes pre-fetched data that is likely to be accessed in the near future. This is a way to improve the time it takes to read from or write to the disk.

There are two types of cache stores:

  1. Cache on the disk

  2. Cache stored on the disk controllers

    1. Disk controller on server

    2. Disk controller on the disk system such as a SAN

In this section, the focus is on cache stored on disk controllers.

Cache Size

Although disk performance is commonly attributed to the disk seek time and rotational speed, the amount of cache also plays a role. However, the performance attributed to cache size is only valid as long as the cache is not full. If the cache is full, there is a significant drop in performance. Ideally, the cache size is large enough so that it never fills for the particular I/O workload.

Disk controller caches on servers ranges in size up to hundreds of MBs. The disk controller on a SAN disk system can have up to tens of GBs.

Battery Backup

Some higher-end disk controllers offer battery backup protection for the cache memory. Because cache is volatile, an unexpected power failure might result in loss of data that was written to cache but not yet written to disk. The battery backup unit provides temporary power to the cache store so that when the system recovers, the data in the cache is flushed to disk before the system attempts to access the data. For specific timing, see the disk controller manual.

SAN – Storage Area Network

SAN is a centralized network of storage devices. The SAN technology transfers raw disk blocks directly onto the network. A centralized storage network allows for server power and utilization to focus on the business application and I/O processing to be offloaded. Other advantages of SAN architecture include the following:

  • Consolidation of resources and better scalability with growing business needs

  • Cost-effective management and operational efficiency, with possible cost savings

  • Increased availability of data resources

  • Storage Virtualization allows a network of heterogeneous devices connected together to seem homogeneous to any server connecting to the SAN network.

A good advanced feature in SAN networks is remote disk mirroring and backups. Remote disk mirroring is storage extension solution that holds a copy of the business critical data at a geographically remote data center.

NAS – Network Attached Storage

A NAS device is a dedicated server to provide file sharing. The NAS can exist anywhere in the network. The major difference between NAS and SAN is that NAS accesses the data by file name and byte offsets, whereas SAN accesses data by raw data blocks. The consolidation advantages with NAS are very similar to the ones noted previously for SAN. However, some NAS systems cannot guarantee the write ordering and write-through required by for SQL Server 2005. The systems that are Windows Hardware Quality Lab (WHQL) certified meet this requirement. For more information, see the KB article Description of support for network database files in SQL Server.

Disk Drives

Capacity

Capacity refers to the storage size of the disk drive. It is important to note that the marketed disk capacity differs from the true disk capacity. Hard disk manufacturers often use metric prefixes such as giga or kilo. This is mostly from historical reasons when 2^10 (1024) bytes was called a kilobyte because it was a close enough value. This way of labeling remained as disk drive capacities grew to gigabyte and terabyte sizes. Therefore, a user will find the size reported by the OS is less than the one advertised by the manufacturer.

Rotational Speed

The rotational speed is the speed at which the disk spins. The higher the rotational speed, the more data the drive can read/write in a fixed time. At high rotational speeds, the drives produce heat as a byproduct. This can negatively impact the performance of the disk, so good ventilation in a drive is something to keep in mind. Most home computer hard drives now run at 7200 rpm. Some servers have rotational speeds at 15,000 rpm or faster.

Seek Time

Seek time refers to the time it takes for the head of the drive to find the correct place. The seek time is one of the largest factors in performance of the drive. However, the absolute seek time is dependent on the distance of the head’s destination from its place of origin at the time of the read/write instruction. Often, the “average” seek time is referred to. A typical seek time for a hard disk is usually in the single-digit milliseconds.

Cache Size

Cache size is sometimes referred to as internal buffer size. The disk cache size is the size of volatile memory integrated on the disk drive which holds any recently requested, written, or pre-fetched data. Disk cache size ranges from 2 MB to 16 MB.

Interface

Of all the factors in buying hard disks, the interface might be the key factor in hard disk performance. While the time taken to access the data is important, the bulk of the time is spent on data transfer instead of moving the heads around. The different types of interfaces include SATA, IDE, SCSI, FC, and iSCSI. For more information, see the Interfaces section earlier in this paper.

Appendix B: SQL Server 2005 Disk Usage

This appendix lists the major disk usages for SQL Server 2005. This might not be an exhaustive list for some specialized systems.

OS/Base Software

Any server box requires hard disk space to store the operating system, SQL Server, and any other tool binaries and library files. The space required for a full SQL Server installation depends on the SKU, computer type (32-bit vs. 64-bit), add-on server tools, etc.

Tempdb Database

Tempdb files are the data and log files associated with the tempdb database. By default, the temporary data and log files are tempdb.mdf and templog.ldf, respectively. There is only one tempdb database per instance of SQL Server. The tempdb database is used to hold temporary user objects such as tables, stored procedures, variables, and cursors. It also holds work tables, versions of the tables for snapshot isolation, and temporary sorted rowsets when rebuilding indexes with SORT_IN_TEMPDB. The total size of tempdb can have an effect on the performance of your system.  For more information about tempdb, see SQL Server Books Online.

Model Database

The model database is used as a template database on which all other user-created databases are modeled. By default, the model database has 3 MB and 1 MB for its data file size and log file size, respectively, for most editions of SQL Server 2005. The size of the model database might increase if the user adds tables, stored procedures, or functions, or other objects to the model database. It can also be increased manually by using the ALTER DATABASE commands. For more information about the model database, see SQL Server Books Online.

Master Database

The master database holds all the system level information for a SQL Server system. This includes system configurations, accounts information, and user database information. Initial configuration for master database is approximately 6 MB and 1.25MB for the data and log files, respectively. These files grow as the database system becomes more complicated and contains more databases.

User Data Files

User data files store actual data for a user-created database. The data files are created by copying the model database.

User Log Files

User log files store logging information for the specific database to which the log file belongs. By default, the log files at creation by default are copies of the model database.

Backup Files

Backup files are created during backup. The location of the backup files is user-determined at backup time. The size of the backup depends on the size of the database, the type of backup (full, full differential, differential, log, file, filegroup, etc.), and the amount of DML since the last backup (for differential backups). The size of a full backup (the largest type of backup) is at most the size of the database itself, because any space not allocated in the database does not get backed up.

Others (not covered)

Other files or data that consume hard disk space in a SQL Server system might be Full-text Search catalogs, OLAP cubes, and others.

Download

DownloadPhysDBStor.doc
353 KB
Microsoft Word file
Get Office File Viewers