High Availability and Disaster Recovery (Middleware)---a Technical Reference Guide for Designing Mission-Critical Middleware Solutions

Article
03/05/2012

Want more guides like this one? Go to Technical Reference Guides for Designing Mission-Critical Solutions.

High availability (HA) is the measurement of a system’s ability to remain accessible in the event of a system component failure. Generally, HA is implemented by building in multiple levels of fault tolerance and/or load balancing capabilities into a system. On the other hand, disaster recovery (DR) is the process by which a system is restored to a previous acceptable state, after a natural or man-made disaster. While they both increase overall availability, a notable difference is that with HA there is, generally, no loss of service. HA refers to the retaining of the service and DR to the retaining of the data. Whereas, with DR there is usually a slight loss of service while the DR plan is executed and the system is restored. HA and DR strategies should strive to address any non-functional requirements, such as performance, system availability, fault tolerance, data retention, business continuity, and user experience. It is imperative that selection of the appropriate HA and DR strategy be driven by business requirements. For HA, determine any service level agreements expected of your system. For DR, use measurable characteristics, such as Recovery Time Objective (RTO) and Recovery Point Objective (RPO), to drive your DR plan.

Best Practices

Development

When using an HLB understand the various load balancing methods available to distribute the load. These methods include round robin, least number of connections, response time, and so on. See HLB Frequently Asked Questions for further details.
HA in BizTalk can be achieved by configuring load balancing with multiple host instances and clustering hosts to provide fault tolerance. See Sample BizTalk Server High Availability Scenarios, Providing High Availability in the BizTalk Server Operations Guide, Checklist: Providing High Availability with Fault Tolerance or Load Balancing in the BizTalk Server Operations Guide, and Improving Fault Tolerance in BizTalk Server by Using a Windows Server Cluster.
For DR in BizTalk environments, see:
- Disaster Recover in the BizTalk Server Operations Guide
- Checklist: Increasing Availability with Disaster Recovery in the BizTalk Server Operations Guide
See How to Cluster Message Queuing to configure MSMQ for HA.
For a detailed review of the multiple technologies in SQL Server that provide HA, see High Availability with SQL Server 2008.
When clustering SQL Server, be sure to account for the hardware and software components required for clustering to work. See SQL Server 2008 Failover Clustering.
If you require zero data loss and HA, use a combination of features such as backup, log shipping, and database mirroring in SQL Server. See Failure Is Not an Option: Zero Data Loss and High Availability for guidance.
See techniques in the BizTalk Server Database Optimization white paper for optimizing your BizTalk database infrastructure.
If you are using BizTalk RFID, see BizTalk RFID: Clustering BizTalk RFID for High Availability and BizTalk RFID: Capacity Planning and Performance Tuning.
When using SQL Server, find the optimizations that can be made in the disk I/O subsystem to maximize availability. See the following:
Consider a SAN when you have high data throughput requirements and cost is not a factor, and when you have low to medium data throughput requirements and cost is a factor. Identify the underlying implementation differences between SAN and NAS. See The Top 10 SANs vs. NAS Decision Factors for more details.
When using AppFabric Cache, use database mirroring to handle disaster recovery. You must use a witness in your mirroring configuration to minimize the service outage of your cache during a DR scenario. See Disaster Recovery for AppFabric Caching V1.0 (SQL Mirroring).

Management and Administration

See Using Replication for High Availability and Disaster Recovery: A SQL Server 2008 Technical Case Study and Best Practices for a discussion on replication testing strategies and a known bug in SQL Server 2008 when handling conflict resolution.
For AppFabric, artifacts should be backed up for SQL Server, Windows Configuration, and IIS. See Disaster Recovery Considerations for the AppFabric Environment.
For BizTalk DR Configuration, backing up and restoring the entire BizTalk Server environment, and for SSO master secret key, see Backing Up and Restoring BizTalk Server.
If you are using DTC, be sure to configure and test DTC between all clients and servers. See Troubleshooting Problems with MSDTC.

Performance

To maximize the performance and scalability of SQL Server 2008 Analysis Services, see Scale-Out Querying for Analysis Services with Read-Only Databases.
If you are considering consolidation via hardware virtualization for both OLTP and OLAP databases, see High Performance SQL Server Workloads on Hyper-V.
See the BizTalk Server 2009 Performance Optimization Guide and BizTalk Server 2006 R2 Performance Optimization Guide for complete guides on optimizing BizTalk for performance.
See the BizTalk Server 2006: Managing a Successful Performance Lab white paper on how to execute performance tests on your BizTalk solution.
If you are using SQL Server replication in a geographically distributed scenario, consider network solutions such as a WAN accelerator to increase network throughput. See Using Replication for High Availability and Disaster Recovery: A SQL Server 2008 Technical Case Study and Best Practices for a case study.

Case Studies and References

Examples of successful architectures are described in the following case studies and white papers:

Questions and Considerations

This section provides questions and issues to consider when working with your customers.

Development

Consider HA at multiple tiers within your architecture: web, application, and data.
Understand the database level object design and atomic transaction constraints when considering a database for SQL Server replication. For example, PRIMARY KEYS are required for replication. Also, understand how triggers behave in this replication scenario.
When using BizTalk adapters such as POP3, MQ, MSMQ, and FTP, be sure to run them under a clustered BizTalk host for proper operation and HA.

Deployment

For application services such as BizTalk, consider adding multiple host instances to provide fault tolerance and load balancing.
Understand the clustering configuration required for transaction support and to avoid message loss when sending and receiving messages between MSMQ and BizTalk.
For data services such as SQL Server, understand the multiple technologies available to support HA and DR such as failover clustering, replication, log shipping, and mirroring. Understand the differences and constraints within each technology.
When considering a replication strategy for a database, understand the difference between a software replication technology such as the one provided by SQL Server versus hardware replication via a SAN replication solution.
Understand how the physical distance between databases used for replication impacts the overall cost of implementation and latency. Consider the differences between replicating a database to another server in the same physical data center versus replicating a database across the world.
Recognize how network bandwidth plays a role between the servers used in replication.
Identify the various RAID configurations available for physical disks. Analyze costs versus performance, availability, and disk reliability as part of the selection criteria for each RAID configuration.
Become familiar with the benefits and drawbacks of using a server’s local disks versus a Storage Area Network (SAN), a Network-Attached Storage (NAS) or a hybrid SAN-NAS solution.

Management and Administration

Document and test your HA and DR strategy prior to bringing your system online.
If you are using failover clustering or replication for HA, simulate a failover scenario under varying levels system usage by simulating one or more system component failures. Measure data loss or service unavailability to determine the reliability of your cluster configuration.
If you are using replication or mirroring for disaster recovery, simulate a disaster scenario and measure data loss and recovery time to verify that you meet your RTO and RPO.
When using web session state on the server, configure sticky sessions or server affinity within Windows NLB or HLB. Alternatively, consider storing session state in a central cache server such as AppFabric Cache. A centralized cache can provide HA for frequently accessed data and avoid the need for server affinity. In the case of using SQL Server replication to support a multi-master database scenario, take into consideration how conflict resolution will be handled. Will your system generate data conflict reports? Who will review these reports and how often?
When using BizTalk, be sure to configure and schedule backups for the BizTalk databases in support of a DR scenario.
When using AppFabric, be sure to back up relevant settings and configuration files in support of a DR scenario.
Does your application or service require the Microsoft Distributed Transaction Coordinator (MSDTC)?

Performance

Measure the availability of your system by planning and executing performance tests on your system and, when available, use a performance paper as a benchmark comparison.
To gain performance on a solution where server(s) resources are near their capacity, first reflect on the services being delivered and if these can be better distributed among the other servers, and then consider scaling up before scaling out.
While more expensive, consider scaling out to provide load balancing and fault tolerance. In contrast, scaling up can yield additional output in performance, but does not provide fault tolerance.
Consider tuning the server OS and network devices to provide an optimized baseline environment for your applications.
If SSL is required for HTTP endpoints, use an SSL acceleration solution to speed up HTTP response times.
When using replication between geographically distributed data centers, make an educated decision among the available network-based solutions types, since throughput optimization will become very significant.

Appendix

< Full URLS for Hyperlinked Text>

High Availability and Disaster Recovery (Middleware)---a Technical Reference Guide for Designing Mission-Critical Middleware Solutions

Best Practices

Development

Management and Administration

Performance

Case Studies and References

Questions and Considerations

Development

Deployment

Management and Administration

Performance

Appendix

Additional resources