Published: April 25, 2008 | Updated: October 10, 2008
Figure 5. Implementation
Implementing reliability involves the following activities:
- Developing various plans: availability, capacity, data security, disaster recovery, monitoring
- Reviewing and adjusting the plans for suitability before approving them
These reliability plans address the traditional objectives of availability, disaster recovery, capacity, data integrity, and monitoring functions. They can exist separately or be combined, depending on specific organizational requirements and scale. For larger, more complex organizations, it might be appropriate to retain some or all of the traditional plans individually. However, they should be managed collectively so as to take advantage of common objectives and technical solutions. This strategy helps the organization achieve higher reliability more cost-effectively.
IT management uses the requirements from the planning process to build corresponding plans that will allow their departments to meet or exceed delivery expectations. Activities and tasks performed during this process can include:
- Analysis of existing infrastructure and how the new service will affect it.
- Evaluation of new technologies that can help IT achieve the desired outcomes.
- Validating that the plans meet delivery expectations and align to infrastructure standards.
- Adapting plans as business needs change.
The following table describes these implementation activities in more detail.
Table 5. Activities and Considerations for Implementation
Develop availability plan
The availability plan describes the plans that ensure high availability for the service. It addresses the hardware, software, people, and processes related to the service.
- What are the key components that make up this service?
- Are there any single points of failure in the application delivery architecture? Can we mitigate these?
- How has this service performed in the past, either in production or in a pilot or lab environment? What have we learned from this, and how will it affect our availability planning?
- What are the specific recoverability targets for this service? Are they achievable, or do we need to evaluate new or alternate technologies such as virtualization or clustering?
- Do we need vendors or suppliers to deliver components of this service? Are they able to commit to OLAs?
- Does trend data highlight inherent resilience problems with this service or its components?
- Service map
- Incident and Problem Management trend data for analysis
- Technical design and architecture documentation
- Component impact failure analysis
- Existing process documentation
- Incident lifecycle targets
- Availability requirements and targets from SLAs, OLAs, and underpinning contracts (UCs)
- Availability plan
- Technical design recommendations or updates for both availability and recoverability
- Updated incident lifecycle targets
- Skills and resourcing recommendations
- Updated OLAs and UCs
- Failover, backup, and configuration recommendations
- Proactive maintenance recommendations
- Design services for high availability. Proactive design is more cost-effective and efficient than retrofitting. Careful use of redundancy allows a service to tolerate failures of individual components.
- Identify, document, and schedule regular preventative maintenance tasks. Manage these tasks to ensure that they are performed to agreed-upon specifications.
- Continually review the availability plan, requirements, and performance. This ensures ongoing alignment between IT and the business and addresses changing technologies and business requirements.
- Proactively investigate technology improvements and developments, such as virtualization, to determine whether they can be used to increase availability or to reduce risk of individual component failure.
- Identify and manage availability risks as a regular activity, use the outcomes of this activity to identify the priorities and business benefits of improvement activities.
Develop capacity plan
The capacity plan outlines the strategy for assessing overall service and component performance and uses this information to develop the acquisition, configuration, and upgrade plans. It provides management with a clear statement of resource and service capacity; an assessment of current capacities; a list of resources to be upgraded or acquired; and a projection of future capacity requirements.
Capacity planning can be broken down into three sub-processes:
- Business capacity planning, which looks at the business requirements
- Service capacity planning, which looks at the end-to-end service capacity
- Component capacity, which looks at the individual components that make up the service
- How do the dependent services affect capacity?
- What are the future business and IT plans for growth, mergers, and acquisitions that may affect the capacity requirements for this service?
- Are there business peaks or other time-based variations in demand that can affect the capacity requirements? How can we maximize use of quiet times?
- What historical trend data can we analyze to evaluate current capacity and how can we use this data to model capacity against various scenarios?
- How can we meet the capacity predictions of the organization in the simplest, most cost-effective way?
- Business plans and forecasts
- IT strategy and growth plans
- Usage trends
- Budget guidelines
- Service catalog
- Anticipated IT deployment and/or release plans
- Internal marketing or awareness campaigns that may drive an increase in demand for this service
- Workload information
- Capacity-related incidents, problems, and emergency change records
- Updated capacity plan
- Request for Change (RFC) to update infrastructure or service components (see Change and Configuration SMF for more information)
- Component and service monitoring thresholds and alerts
- Perform capacity planning during the design phase of a service or solution. Adding to or changing the service and components after the fact is costly and time consuming.
- A close relationship with the business will help to identify factors that will affect capacity, such as a change in business processes or usage patterns. An ongoing and regular review of capacity requirements is a good practice.
- Adding additional capacity can be done using a number of strategies:
- Incremental versus replacement
- Scaling up versus scaling out
- New technology
- Parallel versus hub-and-spoke
- Wherever possible, establish automation of monitoring and alerts on preset thresholds to reduce manual effort and error.
- Because historical performance in different scenarios provides good insight into the service behavior, it is useful to retain and store this type of information for analysis and trending predictions.
Develop data security plan
Description: The data security plan describes how the service will be brought to acceptable levels of security. It details existing security threats and how implementing security standards will mitigate those threats. The information security plan must address the three goals of information security, namely:
- Data confidentiality. No one should be able to view an organization’s data without authorization.
- Data integrity. All authorized users should feel confident that the data presented to them is accurate and not improperly modified.
- Data availability. Authorized users should be able to access the data they need when they need it.
- How does this service use data or facilitate information transfer?
- How critical is this information to the organization, and does it require regulatory oversight?
- Do we have an internal policy that defines the information type and how this information should be handled?
- Who needs access to this information, and where is it stored? Do we need to make arrangements for contingent staff?
- How long should we retain the data, and how do we best dispose of it when we no longer need it?
- What controls and monitoring should we use to ensure that this data is not compromised? If it does get compromised, how will we know?
- What reporting do we need to provide, and to whom? Will we eventually need to make the reports available to internal/external auditors?
- Information classification policy
- Information handling guidelines
- Regulatory or legal requirements related to secure information handling
- Technical and security design documentation
- Information management plan that includes processes for internal staff and external or contract staff
- Recommended security controls to protect the information
- RFC to implement security controls and/or make design changes
- Audit and monitoring recommendations
- User awareness communication plan
- Generate an overall security policy—supported and enforced by the business—that addresses the requirements, handling and escalation procedures, and user responsibilities relating to corporate data and security incidents. This policy should be regularly communicated to staff through an ongoing awareness campaign.
- Implement technical measures to prevent malicious and/or accidental security breaches. The rule of least privilege provides the most control.
- Wherever possible, use tools and automation to help manage the security of the data. There are many technical solutions that provide this capability.
- Perform regular security audits and penetration tests to ensure that any weaknesses are identified. Factor data security considerations into any future service changes, RFCs, and technical support responsibilities.
Develop disaster recovery plan
The disaster recovery plan (DRP) is a documented and tested plan of the actions and activities that IT will follow in the event of a disaster. The purpose of this plan is to ensure that critical IT services are recovered within required time frames.
- What does the organization’s business continuity plan describe, and does this (new) service meet the definition of a business-critical system?
- Can the disaster recovery plan for this service be attached to the organization-wide disaster recovery plan, and if so, how does this affect the broader plan?
- What is the most appropriate yet cost-effective recovery scenario that still mitigates the business risk?
- Hot standby. Dedicated computer equipment mirroring critical business systems, ready to take over immediately with no loss of data
- Warm standby. A location with suitable computer equipment ready to recover service
- Cold standby capability. A setup that would need to be provisioned in the event of a disaster
- Do we have to test this scenario regularly?
- What are our legal or regulatory reporting obligations?
- Do we have a tried and tested recovery plan?
- Are our users and staff aware of our disaster recovery plans and do we need to communicate them regularly?
- Regulatory and statutory requirements
- Existing continuity and disaster recovery plans from the business and IT
- Business impact analysis (BIA)
- Updated disaster recovery plan
- Communications plan
- Updated OLAs and UCs with vendors and suppliers
- Emergency response plan
- IT disaster recovery test plan
- Ensure that the IT Disaster Recovery plan aligns with and supports the Business Continuity plan. The business knows which of its processes and functions are the most critical; and due to high costs and complexity, it is only these critical processes that will likely warrant a hot or warm recovery option.
- Make risk management an ongoing activity. Risk identification and mitigation strategies are important activities that ultimately reduce exposure and service continuity planning costs. For more information on risk management, see the Governance, Risk, and Compliance SMF.
- Ensure that IT and business users can work remotely, and incorporate this capability into the daily business—this practice will promote greater flexibility under disaster conditions.
- Consider the possibility that at least some IT staff may not be available in the event of a disaster. Consider planning for augmentation of staff by using vendor or supplier resources for business-critical functions.
- Consider performing regular disaster recovery exercises. This practice can be costly but because of the significant disruption a disaster can cause, it might be a worthwhile expenditure.
Develop monitoring plan
The monitoring plan defines the process by which the service will be monitored, the information required, and the ways in which the results will be reported and used. For more information about monitoring, see the Service Monitoring and Control SMF.
- What do we need to monitor to determine the reliability of this service? What service components need to be monitored to provide an overall framework for measuring service reliability?
- Monitoring requirements have been identified in availability, capacity, information security and service continuity plans; what is the best way for us to implement these in a consolidated and cost-effective manner?
- Which thresholds will tell us a major component failure is imminent, and what are the triggers that will indicate that a failure has occurred?
- Are we able to monitor these components with the existing infrastructure, and does this allow us to correlate events and initiate responses? Does the service have a health model and built-in instrumentation to assist with monitoring requirements?
- If a trigger event occurs, what steps should IT staff take? Do we have documented response plans and have we communicated these to key IT staff?
- Service health model
- Capacity and workforce plans
- Service map
- Technical design documentation
- Business transaction priorities (which transactions are important, and the service levels they need to deliver)
- Service Monitoring and Control function of Service Management
- Thresholds and alerts that allow insight into the end-user experience with this service
- Service health monitoring model
- RFC to implement monitoring changes
- Reporting templates
- Run book for the Service Desk, explaining how to respond to problems that are raised by monitoring activities
- Collaborate on operations requirements early in the service design activities. As with most proactive activities, retroactively adding a monitoring capability that matches infrastructure standards is expensive and rarely accomplished.
- Automate monitoring and alert responses wherever possible. Automation reduces errors, improves response times, and frees staff for other tasks.
- Consider using synthetic transactions (such as a simple, system-generated test e-mail) to holistically measure how the service is behaving. While it is easy to monitor individual components, the really important measurement is that of the end-user experience with the service.
Review and approve plans
- Are the plans sound and complete?
- Can they be implemented cost-effectively and practically?
- Do the outcomes of the plans meet the original objectives and expectations from the business?
- If it’s not possible to meet the objectives within the budget, will the business relax the requirements, or will it increase the budget to allow meeting the original objectives? What tradeoffs are necessary?
- Do these plans align with overall IT strategy? Does strategy need revision and alternate solutions evaluated?
- Can service performance be monitored effectively? Is the right information gathered to be able to assess service reliability and to identify areas for improvement?
- Is it necessary to adjust resources or add skills?
- Availability plan
- Capacity plan
- Information security plan
- IT service continuity plan
- Monitoring plan
- Approved plans
- New or updated RFC
- Updated project charter, if applicable
- Review of requirements/expectations/budget
- Ensure that these plans, infrastructure changes and updates, technology implementations, and monitoring recommendations work well together. Identify and drive for a common set of technologies and components that can be shared as much as possible. These plans should be approved by both IT and the business.
- It is equally important that activities and actions from the reliability function are consistent with, and align to, the overall IT strategy and standards. Unless there is no alternative, avoid customized solutions that require additional resources, licenses, and skills to manage.
- Take advantage of assistance offered by vendors, suppliers, and service providers during plan development—they can help with technical advice, best practice guidance, design and architecture reviews, and implementation assistance.