
Database Consistency and -1018 Errors
When a page is read, ESE examines a flag on the page to see whether the page has the current checksum format. The appropriate checksum is then calculated. If there is a checksum mismatch with the current format checksum, ESE tries to correct the error. If the error cannot be automatically corrected, Exchange reports a -1018 error.
The Exchange store might be responsible for self-generating a -1018 error, if the Exchange store does one of the following:
-
Constructs a page that has the wrong checksum.
-
Constructs a page correctly, but tells the operating system to write the page in the wrong location.
If a system administrator encounters a -1018 error or runs diagnostic hardware tests against the server and these tests report no issues, the administrator might conclude that Exchange must be responsible for the issue, because the hardware passed the initial analysis.
Frequently, additional investigation by Microsoft or hardware vendors uncovered subtle issues in hardware, firmware, or device drivers that are actually responsible for damaging the database file.
Ordinary diagnostic tests might not detect all the transient faults for several reasons. Issues in firmware or driver software might fall outside the capabilities of diagnostic programs. Diagnostic tests might be unable to adequately simulate long run times or complex loads. Also, the addition of diagnostic monitoring or debug logging might change the system enough to prevent the issue from appearing again.
The simplicity and stability of the Exchange mechanisms that generate checksums and write pages to the database file suggest that a -1018 error is probably caused by something other than Exchange. The checksum and incorrect page detection mechanisms are simple and reliable, and have remained fundamentally the same since the first Exchange release, except for minor changes to adapt to database page format changes between database versions.
A checksum is generated for a page that is about to be written to disk, after all other data is written to the page, including the page number itself. After Exchange adds the checksum to the page, Exchange instructs the Microsoft Windows Server operating system to write the page to disk by using standard, published Windows Server APIs.
The checksum might be generated correctly for a page, but the page might be written to the wrong location on the hard disk. This can be caused by a transient memory error, such as a "bit flip." For example, suppose Exchange constructs a new version of page 70. The page itself does not experience an error, but the copy of the page number that is used by the disk controller or by the operating system is randomly changed. This problem can occur if 70 (binary 1000110) has been changed to 6 (binary 000110) by an unstable memory cell. The page's checksum is still correct, but the location of the page in the database is now wrong. Exchange reports a -1018 error for the page when it detects that the logical page number does not match the physical location of the page.
Another kind of page numbering error (caused by Exchange) may occur if Exchange writes the wrong page number on the page itself. But this causes other errors, not the -1018 error. If Exchange writes 71 on page 70, and then performs the checksum on the page correctly, the page is written to location 71 and passes both the page number and checksum tests.
Frequently, a single -1018 error that is reported in an Exchange database does not cause the database to stop or result in a symptom other than the presence of the -1018 error itself. The page might be in a folder that is infrequently accessed (for example, the Sent or Deleted Items folders), or in an attachment that is seldom opened, or even empty.
Even though a single -1018 error is unlikely to cause extensive data loss, -1018 errors are still cause for concern because a -1018 error is proof that your storage system did not reliably store or retrieve data at least one time. Although the -1018 error might be a transient issue that never occurs again, it is more likely that this error is an early warning of an issue that will become progressively worse. Even if the first -1018 error is on an empty page in the database, you cannot know which page might be damaged next. If a critical global table is damaged, the database might not start, and database repair might be partly or completely unsuccessful.
After a -1018 error is logged, you must consider and plan for the possibility of imminent failure or additional random damage to the database, until you find and eliminate the root cause.