High Availability and Disaster Recovery (Middleware)---a Technical Reference Guide for Designing Mission-Critical Middleware Solutions

Want more guides like this one? Go to Technical Reference Guides for Designing Mission-Critical Solutions.

High availability (HA) is the measurement of a system’s ability to remain accessible in the event of a system component failure. Generally, HA is implemented by building in multiple levels of fault tolerance and/or load balancing capabilities into a system. On the other hand, disaster recovery (DR) is the process by which a system is restored to a previous acceptable state, after a natural or man-made disaster. While they both increase overall availability, a notable difference is that with HA there is, generally, no loss of service. HA refers to the retaining of the service and DR to the retaining of the data. Whereas, with DR there is usually a slight loss of service while the DR plan is executed and the system is restored. HA and DR strategies should strive to address any non-functional requirements, such as performance, system availability, fault tolerance, data retention, business continuity, and user experience. It is imperative that selection of the appropriate HA and DR strategy be driven by business requirements. For HA, determine any service level agreements expected of your system. For DR, use measurable characteristics, such as Recovery Time Objective (RTO) and Recovery Point Objective (RPO), to drive your DR plan.

Best Practices

Development

Management and Administration

Performance

Case Studies and References

Examples of successful architectures are described in the following case studies and white papers:

Questions and Considerations

This section provides questions and issues to consider when working with your customers.

Development

  • Consider HA at multiple tiers within your architecture: web, application, and data.

  • Understand the database level object design and atomic transaction constraints when considering a database for SQL Server replication. For example, PRIMARY KEYS are required for replication. Also, understand how triggers behave in this replication scenario.

  • When using BizTalk adapters such as POP3, MQ, MSMQ, and FTP, be sure to run them under a clustered BizTalk host for proper operation and HA.

Deployment

  • For application services such as BizTalk, consider adding multiple host instances to provide fault tolerance and load balancing.

  • Understand the clustering configuration required for transaction support and to avoid message loss when sending and receiving messages between MSMQ and BizTalk.

  • For data services such as SQL Server, understand the multiple technologies available to support HA and DR such as failover clustering, replication, log shipping, and mirroring. Understand the differences and constraints within each technology.

  • When considering a replication strategy for a database, understand the difference between a software replication technology such as the one provided by SQL Server versus hardware replication via a SAN replication solution.

  • Understand how the physical distance between databases used for replication impacts the overall cost of implementation and latency. Consider the differences between replicating a database to another server in the same physical data center versus replicating a database across the world.

  • Recognize how network bandwidth plays a role between the servers used in replication.

  • Identify the various RAID configurations available for physical disks. Analyze costs versus performance, availability, and disk reliability as part of the selection criteria for each RAID configuration.

  • Become familiar with the benefits and drawbacks of using a server’s local disks versus a Storage Area Network (SAN), a Network-Attached Storage (NAS) or a hybrid SAN-NAS solution.

Management and Administration

  • Document and test your HA and DR strategy prior to bringing your system online.

  • If you are using failover clustering or replication for HA, simulate a failover scenario under varying levels system usage by simulating one or more system component failures. Measure data loss or service unavailability to determine the reliability of your cluster configuration.

  • If you are using replication or mirroring for disaster recovery, simulate a disaster scenario and measure data loss and recovery time to verify that you meet your RTO and RPO.

  • When using web session state on the server, configure sticky sessions or server affinity within Windows NLB or HLB. Alternatively, consider storing session state in a central cache server such as AppFabric Cache. A centralized cache can provide HA for frequently accessed data and avoid the need for server affinity. In the case of using SQL Server replication to support a multi-master database scenario, take into consideration how conflict resolution will be handled. Will your system generate data conflict reports? Who will review these reports and how often?

  • When using BizTalk, be sure to configure and schedule backups for the BizTalk databases in support of a DR scenario.

  • When using AppFabric, be sure to back up relevant settings and configuration files in support of a DR scenario.

  • Does your application or service require the Microsoft Distributed Transaction Coordinator (MSDTC)?

Performance

  • Measure the availability of your system by planning and executing performance tests on your system and, when available, use a performance paper as a benchmark comparison.

  • To gain performance on a solution where server(s) resources are near their capacity, first reflect on the services being delivered and if these can be better distributed among the other servers, and then consider scaling up before scaling out.

  • While more expensive, consider scaling out to provide load balancing and fault tolerance. In contrast, scaling up can yield additional output in performance, but does not provide fault tolerance.

  • Consider tuning the server OS and network devices to provide an optimized baseline environment for your applications.

  • If SSL is required for HTTP endpoints, use an SSL acceleration solution to speed up HTTP response times.

  • When using replication between geographically distributed data centers, make an educated decision among the available network-based solutions types, since throughput optimization will become very significant.

Appendix

< Full URLS for Hyperlinked Text>