The purpose of using data replication is to maintain current replicas of the data at remote sites. Exchange servers can use the replicas at the remote site to provide continuity of e-mail service in the event of a storage or site outage in the primary location. Data replication can be propagated in a synchronous or asynchronous fashion. By definition, when data is replicated synchronously, hosts will only get a write complete response from the storage when the I/O writes are committed in both the local and remote locations. In other words, both the local and the remote storage must implement the change before the write is acknowledged to the host as having succeeded. In asynchronous mode, the host will get a write complete from the storage when the write has committed to the local storage, without having to wait for an acknowledgement from the remote storage that it has also updated the replica.
Asynchronous Replication
In general, the host application is not as sensitive in terms of increased write latency to the replication distance when using asynchronous replication as it would be in a synchronous replication mode. However, you should be aware of the following issues when you deploy asynchronous replication:
-
Data Loss Depending on the frequency of data replication, data changes at the remote site may lag the changes made at the primary site. In the case of a primary site outage, the remote site replica will not be completely current. Although this delay is configurable on most storage solutions, you should be aware of the potential for data loss due to this behavior.
-
Data Integrity (Write Order Preservation) Exchange has write order dependencies between the database and its associated transaction logs. Exchange always writes changes to the log files first, before committing those changes to the database files. When in synchronous replication mode, the application controls the write ordering. However, in asynchronous mode, the replication solution controls when to replicate the data. If the solution does not maintain the write ordering during replication, it could potentially corrupt the database files and prevent the databases from mounting when disaster occurs at the primary site.
-
Performance Impact Many vendors claim that their asynchronous solutions do not affect storage performance, but in reality, there will be a performance degradation when running asynchronous replication. Depending on the implementation of the solution, there is no single number to describe the performance expectations. Therefore, customers should well test the solution before deployment, and the goal is to verify the solution can provide adequate storage performance for Exchange users.
Some solution providers use various technologies to address the write order preservation issue. To successfully deploy an asynchronously replicated solution, the customer must work with the vendor to ensure their asynchronous technology meets the following requirements:
-
It can maintain the write-order consistency of all devices in a storage group, including being continuously consistent with each other;
-
It has been proven to be recoverable, preferably in both a lab and a production environment;
-
It is being provided by a vendor with a support plan in place for the replicated data.
Synchronous Replication
The main concern for synchronous replication in Exchange deployment is related to performance. Tests have shown that the client experience is closely tied to write latency. With a synchronous replication solution, the number of mailboxes that can be hosted per Exchange server is reduced. The performance impact largely depends on the replication distance, replication link bandwidth, and utilization. Synchronous replication can cause as much as a 75 percent reduction in mailboxes/server scalability. You should consider this scalability reduction factor when you are working on your Exchange capacity planning methodology. For more information, see "Deployment Planning for Synchronous Replication" later in this topic.
Synchronous replication has commonly been considered a solution that ensures no data loss because the replicas are completely synchronized with the primary storage data files. However, contrary to this common belief, there are scenarios in which synchronous replication solutions can lose data. The following example illustrates such a scenario.
Generally, storage data replication solutions handle a replication link failure in one of the two following ways:
-
Continue to commit write I/O to the primary storage device only, record all the changes made to all the devices which use the replication link in a log file, and store the log in the primary storage.
-
Fail the write operation, so that the application handles the failure as if the disk has failed.
If the replication solution uses the first handling method, data loss is possible. During a link failure condition followed shortly by a primary site failure, data which has been committed after the link failure will not get replicated; therefore, it will be lost along with the primary storage failure. When you design the storage replication solution, keep in mind these types of failure conditions, so that you can build the system to reduce such occurrences. In this example, the customer might want to consider deploying redundant replication links to reduce the chance of data loss.
Synchronous Replication Solutions
Synchronous replication solutions are classified by where the replication occurs, either host-based replication or storage-based replication. Host-based replication generally uses host-based software (filter driver) to interrupt the I/O stream to manage the replication. Storage-based replication occurs off-host at the storage device level. Both replication solutions can be deployed as part of either of the following:
-
Geographically Dispersed Clustering (Geocluster) In this category, the nodes that belong to the same cluster are placed in different sites. Generally, Exchange servers are actively hosted by the nodes at the primary site. Solutions provide synchronous replication of the Exchange data to the remote site(s). In the case of a primary site disaster, the Exchange virtual servers failover to the passive nodes at the remote site and come online using the replicated Exchange data.
The Microsoft Windows® Server Catalog has a category for geographically dispersed cluster solutions. You can search the Windows Hardware Qualified Labs (WHQL)-qualified geographically dispersed cluster solutions at http://go.microsoft.com/fwlink/?LinkId=28572.
-
Others This category includes all other types of synchronous replication deployments which do not use geographically dispersed clustering. These solutions rely on some other means for making use of the replicated Exchange data at the remote site(s) in the event of a primary site failure (for example, a standby solution; replication coupled with disaster recovery processes).
Microsoft strongly recommends that customers obtain assurances from their replication solution vendors on the following issues:
-
Is the solution in the category of a geographically dispersed clustering solution? If so, is it WHQL-certified? If it is not such a solution, is the storage device listed on a solution outlined in the "Cluster Solutions, Geographically Dispersed Clustering Solution" section of the Windows Server Catalog?
-
Will the replication solution prevent all possibility of data loss short of simultaneous outage at all sites?
-
What are the procedures for performing a failover and fail back?
-
Can the replication solution and expected latency handle the planned Exchange user load and provide a quality client experience?