Contents of a Run Book
Updated : November 12, 2002
Contents of a Run Book
A run book should contain all of the information you and your staff need to perform day-to-day operations and to respond to emergency situations. This information should include the following:
Resource information about the data center and its hardware and software
Process information, including step-by-step procedures for operational and emergency processes
The run book should contain all necessary information to enable a staff member to perform any process, from performing a backup to failing over to a remote site.
Resource Information
Procedural Information
The run book should contain the following types of detailed resource information to help your staff perform routine operational tasks and respond quickly and efficiently to data center emergencies:
Contact information — Detailed information about each database administrator (DBA), the building facilities staff, utility companies, and all hardware and software vendors
Hardware components — Detailed information about hardware components of the data center
Software components — Detailed information about software components of the data center
Keeping this critical resource information current and readily available to your staff reduces downtime when disaster strikes.
Record detailed information regarding each individual or company that you or your staff may need to contact in an emergency. This detailed contact information should include the following:
Contact information for each DBA at the primary site, including his or her role in the operational and disaster recovery process
Contact information for the building facilities staff, the power company, the phone company, and other applicable utilities companies
Contact information for your remote site, if you have one, and for all DBAs at that site
Hardware, software, and service vendor support phone numbers, e-mail addresses, account numbers, and login and password information for related Web sites
Contact information for other server applications on the server, including developers, analysts, testers, and managers affected by a change to the application, related systems, or processes
In addition, record any additional contact information that might be useful in troubleshooting and repairing the data center, such as useful e-mail discussion lists and Web sites.
Record detailed information regarding each hardware component in the data center, including the following:
Server hardware
Model and serial number
Brand and speed of the processor
Amount and configuration of memory
Version of the BIOS
Dates and version numbers of firmware
NIC cards, including their vendors and model numbers
SCSI host adapter or fiber channel cards, including their vendors and model numbers
Local storage hardware
Type, size, and number of drives, including cache if any
Logical disk configuration
RAID levels
Disk controller information (including write cache settings)
Dates and versions of firmware for drives and controllers
Special options used, such as allocation units
Disk arrays and storage area networks
Vendor and model
Type, size, and number of drives, including cache if any, and controller to which the disk is connected
Logical disk configuration
RAID level
Number of controllers and number of channels
Disk controller information (including write cache settings)
Dates and versions of firmware for drives and controllers
Special options used, such as allocation units
In addition, record all additional information about the data center hardware that might be useful in troubleshooting and repairing the data center. For example, record a map of the physical wiring of specific drives to specific array controllers.
Record detailed information about each software component in the data center:
All software
Serial numbers and/or license keys
The network share location for all software installed on the server, including all service packs, hardware drivers, and hot fixes
The onsite and offsite location of all software CDs, including license keys and serial numbers
The location of the written documentation for all software
Windows 2000
Operating system version, with service pack level and hot fixes
Server name, IP address, and role in the domain
Customized settings, including terminal server and registry settings
Information on related systems, including contacts, configuration information, and documentation of data interfaces
Local administrator account name and password
MSCS
Cluster configuration, including all cluster IP addresses, cluster name, cluster nodes, and cluster resource groups
User accounts authorized to administer the cluster
Microsoft SQL Server
Installation information, including service pack levels, hot fixes, instance names, server collation, ports, pipes, configuration options, virtual IP name and address, database file locations, file groups, service logins and passwords, e-mail account, and enabled network protocols
Information about file shares used by the SQL Server and SQL Server Agent service accounts and the associated permissions on those shares
Database collations if different from the server collation
Server roles, database schemas, user accounts, permissions, database roles, custom error messages, and the location of scripts to recreate these objects
List of all automated SQL Server Agent jobs (specifically including all backup jobs), what they do, who is notified, their corresponding code for each job step, the time or times they run, and the location of scripts to recreate the jobs
List of all alerts, what they do, the associated error number or performance condition, who is notified, and the location of scripts to recreate the alerts
Linked server, remote server, replication, and log-shipping configuration information
Distributed database and distributed partition information, including information such as Data Dependent Routing Tables and distributed transaction marks
List and location of all DTS Packages, including associated login and password information
List, location, and purpose of all custom code that runs on the server, and the location of a backup copy of this code
Names and locations of client tools installed to connect to remote database connections (for example, to heterogeneous data sources), and necessary configuration and connection information
List of additional features in use and relevant configuration information, such as Extensible Markup Language (XML) support for Internet Information Services (IIS), Active Directory service support, and Data Source Names (DSNs)
Analysis Services
Data source and transfer information, including all associated jobs
Location and storage format of the Analysis Services repository
Analysis Services repository backup job information and storage location
Location of data files
Security architecture, including logins, database roles, and cube roles
In addition, record all additional information about the software that might be useful in troubleshooting and repairing the data center. For example, record the staff members who are most familiar with custom applications.
Develop and document procedures for each operational and emergency task that you and your staff perform. Whenever possible, develop Transact-SQL scripts for each of these tasks and automate the execution of these scripts by using SQL Server jobs or DTS packages. The procedural information should include the detailed steps and scripts for performing the following tasks utilizing both SQL Server Enterprise Manager and Transact-SQL scripts:
The DBA staff performs many routine operational tasks. To avoid problems, your staff should perform these tasks by using the same procedures each time. Record step-by-step procedures for performing each of the following types of routine operational tasks:
Security tasks
Changing the domain user account and password used by SQL Server and SQL Server Agent
Creating new logins and database user accounts
Changing SQL Server user passwords
Performing standard and C2 security audits
Scripting login information
Scripting application roles and recording passwords
Scripting linked or remote servers
Restoring logins and database users to another SQL Server instance
System administration tasks
Starting and stopping the operating system
Starting and stopping SQL Server services
Changing SQL Server configuration settings
Setting database options
Applying SQL Server service packs
Changing the server name
Manually backing up a database
Manually backing up a transaction log
Monitoring tasks
Monitoring CPU usage
Monitoring disk activity
Monitoring memory usage
Viewing current locks
Viewing current activity
Viewing the last command batch for a specified connection
Viewing the data and log space information for a database
Viewing the oldest active transaction in the database
Viewing the procedure cache usage
Viewing general statistics about SQL Server activity and usage
Identifying and analyzing bottlenecks
Data collection tasks
Archiving system and application logs in the event viewer
Archiving SQL Server error logs and SQL Server Agent logs
Archiving SQL Server setup logs
Archiving the cluster log file
Archiving sqldiag.exe output
Capturing output from sysperfinfo and sysprocesses
Capturing output from MPS Report tool if available
Troubleshooting tasks
Testing TCP/IP sockets client connections
Testing named pipes connections
Troubleshooting deadlocks
Troubleshooting failover clustering
Troubleshooting replication
Troubleshooting log shipping
Troubleshooting MS DTC transactions
Troubleshooting orphan users
In addition to the foregoing, add step-by-step instructions for other tasks that you and your staff perform regularly.
Record the appropriate response to each type of emergency that may affect the data center. Although the precise tasks vary depending upon the high availability solutions implemented, have a planned and tested response to each of the following types of emergencies:
Natural disasters
Power outages
Server failures
Hardware component failures
User database corruption
System database corruption
Application failures
Network failures
Web server or other necessary server failures
Depending upon the high availability solutions implemented for the data center, the detailed steps will include MSCS failover and failback steps, log-shipping role change steps, transactional replication role change steps, and database restoration steps. These procedures should document the process of determining when to initiate a failover or a role change and how affected users are notified. These procedures must include steps to verify the system's state before bringing a restored system or database online. They should also include escalation steps in case the first attempt to restore availability fails.