SQL Server 2000 Operations Guide: Service Management

Article
01/28/2010

Service Management

Updated : October 26, 2001

Abstract

This chapter briefly presents the issues facing the database administrator (DBA) in creating a service level agreement (SLA). The process of negotiating an SLA is covered, as well as the elements that should be included in an effective SLA. The chapter discusses the implications for availability of long-running queries, presents blocking and deadlocking, and describes best practices that enable the DBA to maintain the appropriate service levels. After implementing the procedures described in this chapter, the DBA will be able to negotiate an appropriate SLA given the client's needs and the resources available, and will be able to provide that level of service.

Introduction

Providing a sustained, high level of service can be a challenge for database administrators (DBA). With continuous upgrades to the operating system (OS), Microsoft® SQL Server™, the network, and applications, the possibilities for error are numerous. Yet the demand for continuous, flawless service is constant.

The service level agreement (SLA) is essential to the delivery of this service. Not only is the SLA the mechanism by which the customer clearly expresses the demand for a certain service level, it is also a tool that the DBA can use to control and improve the processes that traditionally interfere with the provision of that service. The DBA can use the SLA to justify requests for additional hardware, software, and staff and to regulate how customer demands and problems are handled.

This chapter covers the service issues that should be addressed in your SLA. The chapter provides guidance on the process of analyzing your production environment to determine what kind of requirements and restrictions you should negotiate for. It also details the techniques that you should employ to ensure that your service levels are met.

Design Considerations

The SLA described in this chapter is intended to slowly improve operating conditions over time, serving as a tool for reform. This approach was selected because of its success in numerous data centers in a wide variety of industries. The idea common to these success stories is that problems should be resolved and not just managed.

The tradeoff to this approach is the number of personnel that become involved in the provision of service. This approach involves a DBA, problem tracking personnel, development personnel (assigned to the resolution of critical bugs), test personnel (required to clear bugs), and a customer representative. If your organization cannot provide these resources for the application you must maintain, you may find that the SLA described in this chapter is not optimal for your situation.

One other factor should be considered in creating an SLA. You will probably want to write a general SLA that can be used with most of your customers, but that includes addenda, or exceptions, for special situations. The standard SLA can then be referred to as the baseline performance standard, so that the negotiations and drafting of agreements can focus on the factors that make service to a particular organization unique. For example, in some organizations, the accounting department will have special month-end, quarterly, and year-end requirements. And, in very large international organizations, subsidiaries may not have the same year end as the parent organization.

Resource Requirements

To implement the SLA described in this chapter, you will need the following software:

Trouble ticket system. This can be any organized method for issuing, assigning, routing, and resolving trouble tickets. While numerous third party packages exist for the sole purpose of handling trouble tickets, a well-planned system based on e-mail can work just as well.
Bug tracking system. This system needs be tightly integrated with the development environment and requires strict controls on how bugs are opened, assigned, and closed. Use of a third-party package is strongly recommended.

You will also need to assign people to the following roles in addition to the DBA role:

Help desk manager. Someone needs to be responsible for the high volume of user calls that your data center could receive if your application fails. Even if your end users do not have contact with you, the error messages in your application and system logs that result from service problems should be treated as critical requests that must be managed. The help desk manager tracks these requests, ensures that they are turned into trouble tickets, and follows the tickets through to closure.
Help desk personnel. The help desk personnel perform two functions: to open new trouble tickets and to work on trouble tickets assigned to them. The kind of work your help desk personnel perform on open trouble tickets depends on your environment. If your help desk personnel also perform operational tasks in your data center, you can expect them to troubleshoot and possibly solve the problems they are assigned. If your help desk personnel work in a call center, it may be more appropriate for them to just gather status on open issues.
Manager(s) responsible for the application. The Microsoft Solutions Framework (MSF) divides the responsibility for an application between the Product Manager (the person who defines the features of the application) and the Program Manager (the person who manages the overall development process). Your organization may not divide application responsibility in this way. The minimum that is required is a single person who can negotiate on behalf of the customer. This person must balance the needs of the customer with the resources available within the organization; ensure that trouble tickets are closed to the customer's satisfaction, and that application bugs are being fixed within in a reasonable timeframe; and remove obstacles that stand in the way of these responsibilities being met.
Application developer(s). Each component of your application must have a development resource identified for it. If the original developer is not available, a suitable replacement must be found. The most critical component of an improvement-oriented SLA is the developer who will ultimately fix application bugs. This role cannot be omitted.
Application management. If your application was produced by a third party, your organization must provide in-house application management, with a support contract that allows the reporting of bugs to the third party and includes a resolution framework (hot fixes, future releases, and so forth).
Application tester(s). An independent source must test the fixes provided by development. If your application was developed in house, it is suggested that a test group separate from the development group provide this resource. If no test group exists, make sure to recruit testers with previous test experience. It is also recommended that your test resources have use of an automated test tool that can both simulate application load and perform application tasks from script. If your application was produced by a third party, it may be possible to rely on their test procedures for the verification of hot fixes and new releases.
Bug tool administrator. The tool that tracks the bugs assigned to development will require some basic administration. If a bug is improperly assigned, the developer should not be responsible for reassigning it. Thus, a resource will be required to ensure that this important tool is properly used. Often an application manager fills this role. A test resource can also fill this role; however, using a tester could remove one of the independent checks from the process if the tester had to confirm that they did their own job).
Network administrator. A support resource is required for the resolution of network problems.
Monitoring personnel (optional). In addition to a help desk, your organization may also field an operational monitoring group. This group is primarily responsible for watching enterprise-monitoring tools for any sign of trouble. This group can also handle some of the tasks traditionally assigned to the helpdesk, such as completing trouble reports and managing resolutions.

Process Flowchart

Your SLA should define the process for logging, responding to, and resolving trouble tickets. The SLA should also define the interaction of the trouble ticket process with the development process through the bug tracking tool. To help all parties involved understand this complex set of processes, it is a good idea to include a process flowchart in your SLA. The process flowchart follows a trouble ticket from creation to resolution (see Figure 8.1).

Figure 8.1: Service Management Process Flowchart

Negotiating the SLA

This section describes the process of developing an SLA. The information in this section applies equally to the way that the baseline SLA is produced and how the addenda are produced for each customer. The process is one of clarifying expectations and deliverables. The type of SLA you are developing (baseline or customer-specific) affects the scope of the agreement and who is involved, but the approach is the same.

The large number of roles involved in the provision of service should suggest to you that negotiating the SLA is not an easy task. When you negotiate the SLA, you are negotiating the terms of a number of operational positions. The people who currently hold these positions are likely to be uncomfortable during the negotiation process. Be sensitive to their concerns.

To overcome the fears that surround the SLA negotiation, be sure to include all affected parties. Encourage participation in the crafting of the SLA by making it clear to all parties that the SLA will help to set proper expectations. Let them know that by participating, they can help to set a level of expectation that they can live with. Emphasize that everyone benefits from clear expectations. You might consider opening the negotiation with a review of some current operational pressures, showing all parties how improperly set expectations make these pressures worse.

After you have the participation of all parties, you may want each operational group to vote for a representative to the negotiations. This will allow you to introduce the negotiations to an all-inclusive group, but then work out the details of the SLA with a more manageable set of representatives. You will need representatives from the following groups:

Help desk
Network operations,
SQL operations
Application development
Product owner (someone who can represent the customer and has budget authority)
Monitoring operations (if your organization fields one)

The group representatives or the group as a whole will need to work out a number of issues that comprise the SLA. These issues are listed in the following section. To move the negotiating group through the issues, you might consider providing an outline of the SLA issues at the start of negotiations. This outline could even include suggested text that would resolve the issue. However, if you present any prepared ideas to the negotiators, you must make it clear that they are only suggestions. Remember, the parties involved may already be uncomfortable and may not accept suggestions that appear to be decided in advance.

Elements of the SLA

The SLA should include the following sections:

Scope of agreement. Detail hardware, network, server, and application components that are covered by the agreement.
Levels of service offered. This is the most important section of the SLA. For a thorough review of the subject, see "Service Levels" later in this chapter. The levels of service should detail each operational state an application, service, or server can enter into, from fully operational to nonfunctional, the desired response, and the timeframe in which the response can be expected.
Escalations for failure to meet contracted levels. You must establish in advance an escalation procedure that the customer and help desk representatives can follow when service levels are not being met. In cases of poor service, the most efficient recourse for an aggrieved party to pursue is to demand a higher-than-agreed-to level of support for the problem area. This approach allows you to avoid bringing in an outside party when operational problems occur and instead requires that operational personnel do a lot more work as a result of missing a service level.
Mechanism for changing inadequate responses. After a given response is shown to be insufficient or obsolete, it is important that a new response be introduced to the SLA. By providing a way to easily update the SLA, you are ensuring that your SLA will be used for many subsequent generations of your application.
Description of the trouble ticket process. The SLA should provide an outline of how trouble tickets are to be handled by the support organizations.
Description of the bug tracking process. The SLA should outline the process for bug submission and resolution. The SLA should name the roles that are allowed to submit bugs, the time frames in which bugs must be resolved, and the escalation procedure to be used when bugs remain unresolved.
Description of the change management process. The SLA should outline the process that the development, test, and operations groups must follow to implement a in the production environment. This process should protect production from untested code. It should also allow production to roll back changes that appear to be causing more harm than good. For more information, see Chapter 2, "Change, Configuration, and Release Management."

Delivery of Service

After the SLA is agreed to, each party involved in the SLA should prepare to meet the responsibilities that were established by the document. For operations, this might require putting into place an entirely new support structure.

The two support systems that provide the core of your SLA are trouble ticketing and bug tracking. Trouble ticketing and bug tracking are key to your organization's ability to deliver on the service levels specified by the SLA.

Trouble Ticketing

When a problem occurs in a well-run organization, it is reported by customers or operational staff. The problem is then recorded and tracked in a system that can manage trouble details as they emerge. When the problem is finally resolved, the resolution is also recorded by the system with the expectation that operational personnel will learn from the experience.

If your organization does not use a trouble ticketing system, you must first find out whether customers or operational staff are encouraged to report problems when they first encounter them. If it is easier for customers to leave your site once it fails and visit a competitor's site, you can be certain that this is exactly what will happen. Make sure that your organization encourages the reporting of problems. If you design your system so that customers benefit from pointing out problems, there is a greater chance that they will do so. Also, reward employees that quickly identify operational problems. If your organization does not reward this type of behavior, there is no reason to implement a trouble ticketing system.

After you have removed obstacles to trouble reporting, you can implement a system for issuing and tracking trouble tickets. You can conduct a preliminary test with an e-mail based system. The process would begin with a description of the trouble being e-mailed to the appropriate person. This person would add information to the e-mail chain or would forward the e-mail to the appropriate person. As long as a single group is responsible for issuing and tracking the e-mail chains, an e-mail-based trouble ticketing system can work.

It is recommended that a third-party trouble ticketing system be adopted. This type of system allows for the assignment of trouble tickets, a standard procedure for the closing of tickets, and reporting on ticket status and progress. A third-party system is the only way to manage the work of a large operational body.

Bug Tracking

The best source of ongoing application improvement is the constant tracking and resolution of bugs. Your organization should formalize this important process by introducing bug tracking software (if it has not done so already). Bug tracking software allows application problems to be managed apart from operational problems. Although an organization may open a large number of trouble tickets a day, only one or two new bugs may be found. A separate bug tracking system helps prevent the application development group from being sidetracked with these operational issues.

Your bug tracking system cannot be based on informal e-mail exchanges. The information associated with the bug is far too critical for this. If the information is lost, it could require hours of independent testing by the developer assigned to fix the bug. Furthermore, your system may have to monitor bugs through numerous application releases and numerous development teams. As developers leave a project, the background and resolution of a given bug become invaluable to future developers.

Service Levels

No matter what the customer has asked for, there are actually only two types of service that you have the ability to provide: monitoring and response. Although you may want to improve application performance, and may have even considered offering that as a separate service, focusing on application performance is not beneficial until you have monitored the application and determined where performance needs to be improved. After you have monitored the application and have found an area to improve, any improvements implemented can be considered a response to poor performance. Thus even application tuning can be viewed as a process of monitoring and response.

Your service levels will detail the states you are watching for (monitoring) and the actions you will take when a certain state arises (response). You should also provide the timeframe associated with each monitoring and response pair. A service level can be simply defined as the response to a given operational state within a certain timeframe.

Operational States

To limit the number of service levels that your SLA must detail, it is suggested that you define responses only for the following operational states:

Server down
Service down
Database down
Database object down
Multiple jobs/tasks/packages failed
Single job/task/package failed
Multiple transactions failed
Single transaction failed
Multiple negative indicators present
Single negative indicator present
No negative indicators present

The last three operational states could involve many possible responses. If you are supporting something more than a simple database application, it is recommended that the operational states involving negative indicators be expanded to include indicators for each element of your application. For a complex application, the last three operational states could be expanded to include:

Multiple negative indicators present at the following:
- User interface
- Point of back-end connectivity
- Middle-tier Component Object Model (COM) objects
- Database
- External processing partner
Single negative indicator present at the following:
- User interface
- Point of back-end connectivity
- Middle-tier COM objects
- Database
- External processing partner
No negative indicators present at the following:
- User Interface
- Point of back-end connectivity
- Middle-tier COM objects
- Database
- External processing partner

Timeframes

You must factor a timeframe into the delivery of each of your service levels. Unless your data center is staffed 24 hours a day, seven days a week, the supported timeframes can be simply defined. The timeframes that are most often associated with operational response are as follows:

Real time
Near real time
Delayed

Responses

The most complex element of your service levels will be the responses that you define for each level. Each response needs to be specifically designed to fit your application and your operational environment. All parties involved in the SLA process should work together to determine the appropriate response to each operational state. A detailed discussion of this process is beyond the scope of this chapter.

However, some guidance can be provided for the response creation process. First, do not abandon responses that you think are valid simply because the customer representative does not require them at present. Any responses that you think are valid that are not included in the final service levels should be included in your SLA as "additional responses" or "emergency responses". Your goal should be to define the full range of responses that might be needed at some point.

By defining many responses and creating a support structure that can handle each type of response, you will be able to easily move service levels from one type of response to another when the customer demands it. This is important, because the customer representative you negotiate with may miscalculate the importance of some situation prior to the actual implementation of the SLA. Soon after the SLA has been signed and your support structure has been created, the customer representative may then pressure you to change the way you are dealing with the situation. When additional responses are defined at the outset, both you and the customer representative have something that you can refer to.

Another thing to consider is the full set of roles that could be involved in the response to an operational state. It is a good idea to define responses for all the parties that may be involved in solving a problem. As a problem continues, increasingly valuable support resources should be introduced. The following list provides an example of the roles that might be involved in a problem state and the order in which they should be introduced:

Monitoring personnel. Some of the responses that monitoring personnel should be empowered to engage in are discussed in Chapter 5, "Monitoring and Control." The responses described allow for further investigation of problem states, but are limited due to the skill sets of the operational personnel.
DBA. Your role should be involved in all service affecting states.
Network administrator. The network administration team should be involved in the response to any operational state that involves the network.
Development team. As required, the development team needs to perform troubleshooting, bug identification, and bug resolution.
Personnel provided by support contract. Frequently the servers in your data center will be covered by a support contract that entitles you to use of highly skilled support personnel. Be sure to include the use of this role in your SLA.
Outside development personnel. This role may rarely be employed by your organization as a response to a problem state; however, in some cases using this role may be the only solution. If the original development team is no longer available and a bug fix is required, contracting support from outside developers may be the appropriate response. Consider only reputable organizations, such as Microsoft Consulting Services (MCS), for this role.

Maintaining Service Levels

The following suggestions are intended to help you avoid problem states in your SQL Server environment:

Limit the parts of the application that you cannot control, such as ad hoc reporting and individual queries submitted by users.
Pay close attention to long-running queries. These have greater impact on your service than any other single element.
Monitor blocking and deadlocking.
Employ high availability technologies such as clustering and load balancing.
Remove any single point of failure that you can find. Have redundant resources available at each point in the application path (for example, provide hot back up servers and a hot back up site if possible).
Have an image of each logical drive in your data center backed up to tape. Document how much data would be lost if those original drive images were used.
Regularly back up every database you have any use of (including system databases). Be extremely organized in storing these backups.
Have a disaster recovery plan and practice implementing it if possible.

Summary

The goal of your SLA should be the creation of a responsive support organization. The members of this organization should continually meet the expectations set by the SLA, and they should be provided tools that enable their participation in the support structure that has been defined.

The SLA itself should be a living document that is easily modified as the support infrastructure matures. All parties should embrace the SLA negotiation process and welcome the results. The benefits provided by clear expectations and standardized responses should be clear to all.

To create this environment, you should expect to do a considerable amount of work. You will need to bring parties together, work through complex issues, and address the concerns of those involved. But the results will be worth the effort.