Monitor Performance for AlwaysOn Availability Groups
If one or more of the secondary replica in the availability group is has been in the Synchronizing state for a long period of time, you need to determine the issues that prevents the availability replicas from being fully synchronized. This topic shows you how to estimate whether and when the replicas can be fully synchronized and how to identify the bottleneck when troubleshooting slow synchronization between the primary and secondary replicas.
To estimate the time to full synchronization and to identify the bottleneck, you need to understand the synchronization process. Performance bottleneck can be anywhere in the process, and locating the bottleneck can help you dig deeper into the underlying issues.
Log data is flushed to disk. This log must be replicated to the secondary replicas. The log records enter the send queue.
SQL Server:Database > Log bytes flushed\sec (Performance Counter)
Logs for each database is captured and sent to the corresponding partner queue (one per database-replica pair). This capture process runs continuously as long as the availability replica is connected and data movement is not suspended for any reason, and the database-replica pair is shown to be either Synchronizing or Synchronized. If the capture process is not able to scan and enqueue the messages fast enough, the log send queue builds up.
Performance counter SQL Server:Availability Replica > Bytes Sent to Replica\sec, which is an aggregation of the sum of all database messages queued for that availability replica.
DMVs sys.dm_hadr_database_replica_states, column log_send_queue_size (KB) and log_bytes_send_rate (KB/sec) on the primary replica.
The messages in each database-replica queue is dequeued and sent across the wire to the respective secondary replica.
Performance counters SQL Server:Availability Replica > Bytes sent to transport\sec and SQL Server:Availability Replica > Message Acknowledgement Time (ms)
Receive and Cache
Each secondary replica receives and caches the message.
Performance counter SQL Server:Availabiltiy Replica > Log Bytes Received/sec
Log is flushed on the secondary replica for hardening.
Performance counter SQL Server:Database > Log Bytes Flushed/sec
Wait type HADR_LOGCAPTURE_SYNC
Redo the flushed pages. Pages are kept in the redo queue as they wait to be redone.
Performance counter SQL Server:Database Replica > Redone Bytes/sec
DMV sys.dm_hadr_database_replica_states, columns redo_queue_size (KB) and redo_rate.
Wait type REDO_SYNC
AlwysOn Availability Groups is designed with flow control gates on the primary replica to avoid excessive resource consumption on all availability replicas. After the logs have been captured on the primary replica, they are subject to two levels of flow controls, as shown in the table below.
Number of Gates
Number of messages
1 per availabiltiy replica
Extended event database_transport_flow_control_action
1 per availability database
Wait type DBMIRROR_SEND
Extended event hadron_database_flow_control_action
Wait type DBMIRROR_SEND
Performance counters SQL Server:Availability Replica > Flow control/sec and SQL Server:Availability Replica > Flow Control Time (ms)
Once the message threshold of either gate is reached, log messages are no longer sent to a specific replica or for a specific database. These can be sent once acknowledgement messages are received for the sent messages to bring the number of sent messages below the threshold.
In addition to the flow control gates, there is another factor that can prevent the log messages from being sent. The synchronization of replicas ensures that the messages are sent and applied in the order of the log sequence numbers (LSN). Before a log message is sent, its LSN also checked against the lowest acknowledged LSN number to make sure that it is less than one of thresholds (depending on the message type). If the gap between the two LSN numbers is larger than the threshold, the messages are not sent. Once the gap is below the threshold again, the messages are sent.
To estimate the time to synchronization, you need to know the following pieces of information (unless otherwise noted, the DMV values from sys.dm_hadr_database_replica_states are shown in parentheses):
Average rate of log generation on the primary replica (performance counter SQL Server:Database > Log Bytes Flushed/sec)
Average size of the log send queue on the primary replica (log_send_queue_size)
Average size of the redo queue on the secondary replica in question (redo_queue_size)
Average log send rate on the primary replica (byte_send_rate)
Average redo rate on the secondary replica in question (redo_rate)
Note that you should not simply use the current values on these performance metrics, but the average values overtime. For the primary-secondary replica pair to be synchronized, the log generation rate must not exceed the log send rate or the redo rate. If either of these numbers are higher than the log generation rate, downstream operation can never catch up to log generation, and your secondary replica is in the Synchronizing state indefinitely.
Ignoring any issues related to network latency, the time to synchronization for a given primary-secondary replica pair is the sum of the average time to empty the log send queue and the average time to empty to redo queue. In the calculation, the slower rate between the log send rate and the redo rate becomes the dominant rate you should use in the calculation. If logs are sent faster than they can be redone, the logs they still need to be redone on the secondary replica at a slower pace. So the formula looks like the following:
Time to synchronization = (Average send queue size + average redo queue size) / Slower (average send rate, average redo rate)
Note that these numbers do not include metrics for FILESTREAM traffic. When a FILESTREAM related log block is captured on the primary replica, FILESTREAM messages are sent to the secondary replica indepedent of and in parallel with the log messages. On the secondary replica, the log block is held until all FILESTREAM messages are received. Metrics related to FILESTREAM messages can also be found in the DMV sys.dm_hadr_database_replica_states, columns filestream_queue_size (KB) and filestream_bytes_send_rate. Additionally, the wait type HADR_FILESTREAM_BLOCK_FLUSH gives you wait statistics related to synchronizing FILESTREAM log blocks.
The following extended events are useful when troubleshooting replicas in the Synchronizing state.