Automatic Page Repair (Availability Groups: Database Mirroring)
Published: May 17, 2016
Updated: May 17, 2016
Applies To: SQL Server 2016
Automatic page repair is supported by database mirroring and by Always On Availability Groups. After certain types of errors corrupt a page, making it unreadable, a database mirroring partner (principal or mirror) or an availability replica (primary or secondary) attempts to automatically recover the page. The partner/replica that cannot read the page requests a fresh copy of the page from its partner or from another replica. If this request succeeds, the unreadable page is replaced by the readable copy, and this usually resolves the error.
Generally speaking, database mirroring and Always On Availability Groups handle I/O errors in equivalent ways. The few differences are explicitly called out here.
Database mirroring automatic page repair tries to repair only pages in a data file on which an operation has failed for one of the errors listed in the following table.
|Error number||Description||Instances that cause automatic page-repair attempt|
|823||Action is taken only if the operating system performed a cyclic redundancy check (CRC) that failed on the data.||ERROR_CRC. The operating-system value for this error is 23.|
|824||Logical errors.||Logical data errors, such as torn write or bad page checksum.|
|829||A page has been marked as restore pending.||All.|
Automatic page repair cannot repair the following control page types:
File header page (page ID 0).
Page 9 (the database boot page).
Allocation pages: Global Allocation Map (GAM) pages, Shared Global Allocation Map (SGAM) pages, and Page Free Space (PFS) pages.
On the principal/primary database, automatic page repair is tried only when the database is in the SYNCHRONIZED state and the principal/primary is still sending log records for the database to the mirror/secondary. The basic sequence of actions in an automatic page-repair attempt are as follows:
When a read error occurs on a data page in the principal/primary database, the principal/primary inserts a row in the suspect_pages table with the appropriate error status. For database mirroring, the principal then requests a copy of the page from the mirror. For Always On Availability Groups, the primary broadcasts the request to all the secondaries and gets the page from the first to respond. The request specifies the page ID and the LSN that is currently at the end of the flushed log. The page is marked as restore pending. This makes it inaccessible during the automatic page-repair attempt. Attempts to access this page during the repair attempt will fail with error 829 (restore pending).
After receiving the page request, the mirror/secondary waits until it has redone the log up to the LSN specified in the request. Then, the mirror/secondary tries to access the page in its copy of the database. If the page can be accessed, the mirror/secondary sends the copy of the page to the principal/primary. Otherwise, the mirror/secondary returns an error to the principal/primary, and the automatic page-repair attempt fails.
The principal/primary processes the response that contains the fresh copy of the page.
After the automatic page-repair attempt fixes a suspect page, the page is marked in the suspect_pages table as restored (event_type = 5).
If the page I/O error caused any deferred transactions, after you repair the page, the principal/primary tries to resolve those transactions.
I/O errors on data pages that occur on the mirror/secondary database are handled in generally the same way by database mirroring and by Always On Availability Groups.
With database mirroring, if the mirror encounters one or more page I/O errors when it redoes a log record, the mirroring session enters the SUSPENDED state. With Always On Availability Groups, if a secondary replica encounters one or more page I/O errors when it redoes a log record, the secondary database enters the SUSPENDED state. At that point, the mirror/secondary inserts a row in the suspect_pages table with the appropriate error status. The mirror/secondary then requests a copy of the page from the principal/primary.
The principal/primary tries to access the page in its copy of the database. If the page can be accessed, the principal/primary sends the copy of page to the mirror/secondary.
If the mirror/secondary receives copies of every page it has requested, the mirror/secondary tries to resume the mirroring session. If an automatic page-repair attempt fixes a suspect page, the page is marked in the suspect_pages table as restored (event_type = 4).
If a mirror/secondary does not receive a page that it requested from the principal/primary, the automatic page-repair attempt fails. With database mirroring, the mirroring session remains suspended. With Always On Availability Groups, the secondary database remains suspended. If the mirroring session or secondary database is resumed manually, the corrupted pages will be hit again during the synchronization phase.
An automatic page repair is an asynchronous process that runs in the background. Therefore, a database operation that requests an unreadable page fails and returns the error code for whatever condition caused the failure. When developing an application for a mirrored database or an availability database, you should intercept exceptions for failed operations. If the SQL Server error code is 823, 824, or 829, you should retry the operation later.
The following dynamic management views return rows for the latest automatic page-repair attempts on a given availability database or mirrored database, with a maximum of 100 rows per database.
Always On Availability Groups:
Returns a row for every automatic page-repair attempt on any availability database on an availability replica that is hosted for any availability group by the server instance.
Returns a row for every automatic page-repair attempt on any mirrored database on the server instance.