Exchange Server Error -1018: How Microsoft IT Recovers Damaged Exchange Databases
Exchange Server Error -1018: How Microsoft IT Recovers Damaged Exchange Databases
Technical White Paper
Published: August 1, 2005
|
Situation
|
Solution
|
Benefits
|
Products & Technologies
|
|
Error -1018 signals that an Exchange database file has been damaged by a hardware
or file system problem. Exchange reports this error to provide early warning of
possible server failure and data loss.
|
This paper shows you how Microsoft IT responds to this error and recovers affected
Exchange data. It also covers the methods and tools used to find root causes and
resolve the underlying problems responsible for the error.
|
- Improve your monitoring of Exchange data integrity.
- Increase your ability to determine seriousness and urgency of -1018 errors.
- Learn specific recovery strategies and how to decide when to implement them.
- Improve your operational effectiveness in handling hardware and data integrity
problems.
|
- Microsoft Exchange Server 2003
- Microsoft Windows Server 2003
- Exchange Jetstress and LoadSim I/O and capacity modeling tools
- Microsoft Office Excel 2003
- Medusa Labs Test Tool Suite by Finisar
- Exchange Eseutil and Isinteg repair and integrity verification tools
|
Executive Summary
Error -1018 (JET_errReadVerifyFailure) is a familiar—and dreaded—error in Microsoft®
Exchange Server. It indicates that an Exchange database file has been damaged by
a failure or problem in the underlying file system or hardware.
This paper explains the conditions that result in error -1018. It also covers the
detection mechanisms that Exchange uses to discover and recover from damage to its
database files.
The Microsoft Information Technology group (Microsoft IT) runs one of the most extensive
Exchange Server organizations in the world. Exchange administrators at Microsoft
have investigated and recovered from dozens of -1018 error problems. This paper
shows you how Microsoft IT monitors for this error, what happens after database
file damage has been discovered, and how Microsoft recovers databases affected by
the problem.
Note: For security reasons, the sample names of forests, domains, internal resources,
organizations, and internally developed security file names used in this paper do
not represent real resource names used within Microsoft and are for illustration
purposes only.
Readers of this paper are assumed to be familiar with the basics of Exchange administration
and database architecture. This paper describes Microsoft IT's experience and recommendations
for dealing effectively with error -1018. It is not intended to serve as a procedural
guide. Each enterprise environment has unique circumstances; therefore, each organization
should adapt the material to its specific needs.
While the focus here is on Exchange Server 2003, nearly all the material covered
applies to any version of Exchange. Exchange Server 2003 implements important new
functionality for recovering from -1018 errors. This is discussed in "ECC Page Correction
in Exchange Server 2003 SP1" later in this document.
Introduction
No computer data storage mechanism is perfect. Disks and tapes go bad. Glitches
in hardware or bugs in firmware can cause data to be corrupted. The most basic strategy
for dealing with this reality is redundancy: disks are mirrored or replicated; data
is backed up to remote locations so that when—not if—primary storage is compromised,
data can be recovered from another copy.
Loss of data is not the only risk when data becomes corrupted. If corruption is
undetected, bad decisions may be made based on the data. Stories are occasionally
reported in the press about a decimal point that is removed by random corruption
of a database record, and someone becomes a temporary millionaire as a result. Corruption
of a database can cause even more subtle or difficult errors. In Exchange, acting
on a piece of corrupted metadata could cause mail destined for one user to be sent
to another, or could cause all mail in a database to be lost.
Exchange databases therefore implement functionality to detect such damage. Even
more important than detecting random corruption is not acting on it. After Exchange
detects damage to its databases, the damaged area is treated as if it were completely
unreadable. Thus, the database cannot be further harmed by relying on the data.
The error code -1018 is reported when Exchange detects random corruption of its
data by a problem in the underlying platform. Although data corruption is a serious
problem, it is rare for a -1018 error detected during database run time to cause
the database to stop or to seriously malfunction. This is because the majority of
pages in an Exchange database have user message data written on them. The loss of
a single random page in the database is most likely to result in lost messages.
One user or group of users may be affected, but there is no impact to the overall
structural and logical integrity of the database. After a -1018 problem has been
detected, Exchange will keep running as long as the lost data is not critical to
the integrity of the database as a whole.
A -1018 error may be reported repeatedly for the same location in the database.
This can happen if a user tries repeatedly to access a particular damaged message.
Each time the access will fail, and each time a new error will be logged.
Because the immediate loss of data associated with error -1018 may be minimal, you
may be tempted to ignore the error. That would be a dangerous mistake. A -1018 error
must be investigated thoroughly and promptly. Error -1018 indicates the possibility
of other imminent failures in the platform.
Understanding Error -1018
Error code -1018 (JET_errReadVerifyFailure) means one of two conditions has been
detected when reading a page in the database:
-
The logical page number recorded on the page does not correspond to the physical
location of the page inside the database file.
-
The checksum recorded on the page does not match the checksum Exchange expects to
find on the page.
Statistically, a -1018 error is much more likely to be related to a wrong checksum
than to a wrong page number.
To understand why these conditions indicate file-level damage to the database, you
need to know a little more about how Exchange database files are organized.
Page Ordering
Each Exchange Server 2003 database consists of two matched files: the .edb file
and the .stm file. These files must be copied, moved, or backed up together and
must remain synchronized with each other.
Inside the database files, data is organized in sequential 4-kilobyte (KB) (4,096
byte) pages. Several pages can be collected together to form logical structures
called balanced trees (B+-Trees). Several of these trees are linked together to
form database tables. There may be thousands of tables in a database, depending
on how many mailboxes or folders it hosts.
Each page is owned by a single B+-Tree, and each B+-Tree is owned by a single table.
Error -1018 reports damage at the level of individual pages. Because database tables
are made up of pages, the error also implies problems at the higher logical levels
of the database.
At the beginning of each database file are two header pages. The header pages record
important information about the database. You can view the information on the header
pages with the Exchange Server Database Utilities tool Eseutil.
After the header pages, every other page in a database file is either a data page
or an empty page waiting for data. Each data page is numbered, in sequential order,
starting at 1. Because of the two header pages at the beginning of the file, the
third physical page is the first logical data page in the database. (You can consider
the two header pages to be logical pages -1 and 0.)
Note: Each database file as a whole has a header, and each page in a database also
has its own header. It can be confusing to distinguish between the two.
The database header is at the beginning of the database file and it records information
about the database as a whole. A page header is the first 40 bytes of each and every
page, and it records important information only about that particular page. Just
as Eseutil can display database header information, it can also display page header
fields.
In an Exchange database, you can easily calculate which logical page you are on
for any physical byte offset into the database file. Logical page -1, which is the
first copy of the database header, starts at offset 0. Logical page 0, a second
copy of the database header, starts at offset 4,096. Logical page 1, the first data
page in the database, starts at offset 8,192. Logical page 2 starts at offset 12,228,
and so on.
Each -1018 error is for a single page in the database, and it can be useful in advanced
troubleshooting to be able to locate the exact page where the error occurred.
As general formulas:
-
(Logical page number + 1) × 4,096 = byte offset
-
(byte offset ÷ 4,096) - 1 = logical page number
These examples may be useful:
Suppose you need to know the exact byte offset for logical page 101 in a database.
Using the first formula, (101 + 1) × 4,096 = 417,792, logical page 101 starts exactly
417,792 bytes into the file.
Now, suppose you need to know what page is at byte offset 4,104,192. Using the second
formula, (4,104,192 ÷ 4,096) - 1 = 1,001, logical page 1,001 starts at 4,104,192
bytes into the file.
In most cases, a Windows Application Log event reporting error -1018 will list the
location of the bad page as a byte offset. Therefore, the second formula is likely
to be the most frequently used. In any case, the two formulas allow you to translate
back and forth between logical pages and byte offsets as needed.
The logical page number is actually recorded on each page in the database. (In Exchange
Server 2003 with Service Pack 1 (SP1), the method for doing this has changed. For
more details, see "ECC Page Correction in Exchange Server 2003 SP1" later in this
document.) When Exchange reads a page, it checks whether the logical page number
matches the byte offset. If it does not match, a -1018 error results, and the page
is treated as unreadable.
The correspondence between physical and logical pages is important because it allows
Exchange to detect whether its pages have been stored in correct order in the database
files. If the physical location does not match the logical page number, the page
was written to the wrong place in the file system. Even if the data on the page
is correct, if the page is in the wrong place, Exchange will detect the problem
and not use the page.
Page Checksum
Along with the logical page number, each page in the database also stores a calculated
checksum for its data. The checksum is at the beginning of the page and is derived
by running an algorithm against the data on the page. This algorithm returns a 4-byte
checksum number. If something on a page changes, the checksum on the page will no
longer match the data on the page. (In Exchange Server 2003 SP1, the checksum algorithm
has become more complicated than this, as you will learn in the next section.)
Every time Exchange reads a page in the database, it runs the checksum algorithm
again and makes sure the result is the same as the checksum already on the page.
If it is not, something has changed on the page. A -1018 error is logged, and the
page is treated as unreadable.
ECC Page Correction in Exchange Server 2003 SP1
Exchange Server 2003 SP1 includes an important new recovery mechanism for some -1018
related damage. This mechanism is an Error Correction Code (ECC) checksum that is
placed on each page. This checksum is in addition to the checksum present in previous
versions of Exchange.
Each Exchange page now has two checksums, one right after the other, at the beginning
of each page. The first checksum (the data integrity checksum) determines whether
the page has been damaged; the second checksum (the ECC checksum) can be used to
automatically correct some kinds of random corruption. Before Exchange Server 2003
SP1, Exchange could reliably detect damage, but could not do anything about it.
By surveying many -1018 cases, Microsoft discovered that approximately 40 percent
of -1018 errors are caused by a bit flip. A bit flip occurs when a single bit on
a page has the wrong value—a bit that should be a 1 flips to 0, or vice versa. This
is a common error with computer disks and memory.
The ECC checksum can correct a bit flip. This means that approximately 40 percent
of -1018 errors are self-correcting if you are using Exchange Server 2003 SP1 or
later.
Note: ECC checksums that can detect multiple bit flips are possible, but not practical
to implement. Single-bit error correction has minimal performance overhead, but
it would be costly in terms of performance to detect and correct multiple bit errors.
As a statistical matter, the distribution of page errors tends to cluster in two
extremes: single bit errors and massive damage to the page.
If a -1018 error is corrected by the ECC mechanism, it does not mean you can safely
ignore the error. ECC correction does not change the fact that the underlying platform
did not reliably store or retrieve data. ECC makes recovery from error -1018 automatic
(40 percent of the time), but does not change anything else about the way you should
respond to a -1018 error.
The format of Exchange database page headers had to be changed to accommodate the
ECC checksum. The field in each page header that used to carry the logical page
number now carries the page number mixed with the ECC checksum. This means that
Exchange Server 2003 SP1 databases are not backward compatible, even with the Exchange
Server 2003 original release. The same applies to database tools, such as Eseutil.
With older versions of the tools, the ECC databases appear to be massively corrupt,
because the ECC checksum is not considered.
For more information about ECC page correction, refer to the Microsoft Knowledge
Base article "New error correcting
code is included in Exchange Server 2003 SP1".
Backup and Error -1018
A -1018 error may be encountered at any time while the database is running. However,
this is not how the majority of -1018 problems are actually discovered. Instead,
they are more often found during backup.
A -1018 error is reported only when a page is read, and not all pages in the database
are likely to be read frequently. For example, messages in a user's Deleted Items
folder may not be accessed for long periods. A -1018 error in such a location could
go undetected for a long time. To detect -1018 problems quickly, you must read all
the pages in the databases. Online backup is a natural opportunity for checking
the entire database for -1018 damage, because to back up the whole database you
have to read the whole database.
Exchange Online Streaming API Backups
Exchange has always supported an online streaming backup application programming
interface (API) that allows Exchange databases to be backed up while they are running.
Many third-party vendors have created Exchange-aware backup modules or agents that
use this API. Backup, the backup program that comes with Microsoft Windows Server™
2003 or Windows® 2000 Server, supports the Exchange streaming backup API. If
you install Exchange Server or Exchange administrator programs on a computer, Backup
is automatically enabled for Exchange-aware online backups.
If a -1018 page is encountered during online backup, the backup will be stopped.
Exchange will not allow you to complete an online backup of a database with -1018
damage. This is to ensure that your backup can never have a -1018 problem in it.
This is important because it means you can recover from a -1018 problem by restoring
from your last backup and bringing the database up-to-date with the subsequent transaction
log files. After you do this, you will have a database that is up-to-date, with
no data loss, and with no -1018 pages.
Playing transaction logs will never introduce a -1018 error into a database. However,
playing transaction logs may uncover an already existing -1018 error. To apply transaction
log data, Exchange must read each destination page in the database. If a destination
page is damaged, transaction log replay will fail. Exchange cannot replace a page
with what is in the transaction log because transaction log updates may be done
for only parts of a page.
If you restore from an online backup and encounter a -1018 error during transaction
log replay, the most likely reason is that corruption was introduced into the database
by hardware instability during or after restoration. To test this, restore the same
backup to known good hardware. For more information, see "Can Exchange Cause a -1018
Error?" later in this document.
Restoring from an online backup and replaying subsequent transaction logs is the
standard strategy for recovering from -1018 errors. Other strategies for special
circumstances are outlined in "Recovering from a -1018 Error" later in this document.
Backup Retries and Transient -1018 Errors
Not all -1018 errors are permanent. A -1018 error may be reported because of a failure
in memory or in a subsystem other than the disk. The database page on the disk is
good, but the system does not read the disk reliably. To handle such cases, and
to give the backup a better chance to succeed even on failing hardware, Exchange
has functionality to retry -1018 errors encountered during backup.
If a -1018 error is reported when a page is backed up, Exchange will wait a second
or two, and then try again to read the page. This will happen up to 16 times before
Exchange gives up, fails the read of the page, and then fails the backup.
If Exchange eventually reads the page successfully, the copy of the page on the
disk is good, but there is a serious problem elsewhere in the system. Even if Exchange
is not successful in reading the page, it does not prove conclusively that the page
is bad. Depending on how hardware caching has been implemented, all 16 read attempts
may come from the same cache rather than directly from the disk. Exchange waits
between each read attempt and tries to read again directly from the disk to increase
the likelihood that the read will not be satisfied from cache.
Exchange Volume Shadow Copy Service API Online Backups
If you are running Exchange Server 2003 on Windows Server 2003, you have the additional
online backup option of performing Volume Shadow Copy service-based online backups
of Exchange. The Volume Shadow Copy service online backup API is a new method that
is similar in its capabilities to the streaming backup API, but that can allow for
faster restoration times independent of the database file size. How fast Volume
Shadow Copy service backup is compared to streaming backup depends on a number of
factors, the most important of which is whether the Volume Shadow Copy provider
is software-based or hardware-based. Both software-based and hardware-based providers
can make snapshot and clone copies of files even when the files are locked open
and in use. However, if you use a software provider, the process is no faster than
when making an ordinary file copy. To make the snapshot or clone process almost
instantaneous, even for very large files, you must use a hardware provider.
Backup for Windows 2003 includes a software-based generic Volume Shadow Copy service
provider, but does not support Exchange-aware Volume Shadow Copy service backups.
If you are using any version of Backup for Windows as your Exchange backup application,
you must perform streaming API online backups.
An Exchange-aware Volume Shadow Copy service backup must complete in less than 20
seconds. This is because Exchange suspends changes to the database files during
the backup. If the snapshot or clone does not complete within 20 seconds, the backup
fails. Thus, a hardware provider is required because the backup must complete so
quickly.
Exchange has no opportunity to read database pages during a Volume Shadow Copy service
backup. Therefore, the database cannot be checked for -1018 problems during backup.
If you use a Volume Shadow Copy service-based Exchange backup solution, the vendor
must verify the integrity of the backup in a separate operation soon after the backup
has finished.
For more information about Volume Shadow Copy service backup and Exchange, see the
Microsoft Knowledge Base article "Exchange
Server 2003 data backup and Volume Shadow Copy services".
Application Log Event IDs
When a -1018 error occurs, you will not see a -1018 event in the application log.
Instead, there are several different events that will report the -1018 as part of
their Description fields. Which event is logged depends on the circumstances under
which the -1018 problem was detected.
This listing of events associated with error -1018 is not comprehensive, but it
does include the core events for which you should monitor.
For all versions of Exchange, Microsoft Operations Manager (MOM) monitors for events
474, 475, and 476 from the event source Extensible Storage Engine (ESE). If you
are running Exchange Server 2003 SP1, you should also ensure that event 399 is monitored.
Event 474
For versions of Exchange prior to Exchange Server 2003 SP1, event 474 is logged
when any checksum discrepancy is detected. For Exchange Server 2003 SP1, this event
is logged only when multiple bit errors exist on a page. If a single bit error is
detected, event 399 (discussed later in this document) is logged instead.
Here is an example of a typical event 474:
Event Type: Error
Event Source: ESE
Event ID: 474
Description: Information Store (3500) First Storage Group: The database page read
from the file "C:\mdbdata\priv1.edb" at offset 2121728 (0x0000000000206000) for
4096 (0x00001000) bytes failed verification due to a page checksum mismatch. The
expected checksum was 1848886333 (0x6e33c43d) and the actual checksum was 1848886845
(0x6e33c63d). The read operation will fail with error -1018 (0xfffffc06). If this
condition persists then please restore the database from a previous backup. This
problem is likely due to faulty hardware. Please contact your hardware vendor for
further assistance diagnosing the problem.
The Description field of this event provides information that can be useful for
advanced troubleshooting and analysis. You should always preserve this information
after a -1018 error has been reported. Providing this information to hardware vendors
or to Microsoft Product Support Services may be helpful when troubleshooting multiple
-1018 errors.
The Description field shows which database has been damaged and where the damage
occurred. For translating a byte offset to a logical page number, recall the formula
described in "Page Ordering" earlier in this document. Using that formula, you know
that the page damaged in this error is logical page 517 because (2121728 ÷ 4096)
- 1 = 517. Direct analysis of the page may show patterns that will help a hardware
vendor determine the problem that caused the damage.
The description also lists the checksum that is written on the page as the expected
checksum: 6e33c43d. The actual checksum is the checksum that Exchange calculates
again as it reads the page: 6e33c63d.
Why does it help to know what the checksum values are? Patterns in the checksum
differences may assist in advanced troubleshooting. For an example of this, see
"Appendix A: Case Studies" later in this document.
In addition, you can tell whether a particular -1018 error is the result of a single
bit error (bit flip) by comparing the expected and actual checksums. To do this,
translate the checksums to their binary numbering equivalents. If the checksums
are identical except for a single bit, the error on the page was caused by a bit
flip.
The checksums listed in the preceding example can be translated to their binary
equivalents using Calc.exe in its scientific mode:
0x6e33c43d = 1101110001100111100010000111101
0x6e33c63d = 1101110001100111100011000111101
Single bit difference ^
In the preceding example, if this error had occurred on an Exchange Server 2003
SP1 database, the error would have been automatically corrected.
In Exchange Server 2003 SP1, the checksum reported in the Description field of event
474 shows the page integrity checksum and the ECC checksum together. For example:
Description: Information Store (3000) SG1018: The database page read from the file
"D:\exchsrvr\SG1018\priv1.edb" at offset 2371584 (0x0000000000243000) for 4096 (0x00001000)
bytes failed verification due to a page checksum mismatch. The expected checksum
was 2484937984258 (0x0000024291d88902) and the actual checksum was 62488400759392765
(0x00de00de91d889fd). The read operation will fail with error -1018 (0xfffffc06).
If this condition persists then please restore the database from a previous backup.
This problem is likely due to faulty hardware. Please contact your hardware vendor
for further assistance diagnosing the problem.
Notice that the checksum listed is 16 hexadecimal characters, and in the previous
example, the checksum is eight hexadecimal characters. In the new checksum format,
the first eight characters are the ECC checksum, and the last eight characters are
the page integrity checksum.
Event 475
Event 475 indicates a -1018 problem caused by a wrong page number. It is no longer
used in Exchange Server 2003. Instead, bad checksums and wrong page numbers are
reported together under Event 474. The following is an example of event 475:
Event Type: Error
Event Source: ESE
Event ID: 475
Description: Information Store (1448) The database page read from the file "C:\MDBDATA\priv1.edb"
at offset 1257906176 (0x000000004afa2000) for 4096 (0x00001000) bytes failed verification
due to a page number mismatch. The expected page number was 307105 (0x0004afa1)
and the actual page number was 307041 (0x0004afe1). The read operation will fail
with error -1018 (0xfffffc06). If this condition persists then please restore the
database from a previous backup.
Event 475 can be misleading. It may not mean the page is in the wrong location in
the database. It only indicates that the page number field is wrong. Only if the
checksum on the page is also valid can you conclude that the page is in the wrong
location. Advanced analysis of the actual page is required to determine whether
the field is corrupted or the page is in the wrong place. In the majority of cases,
the page field is corrupted.
Notice that in the preceding example, the difference in the page number fields is
a single bit, indicating that this page is probably in the right place, but was
damaged by a bit flip.
Event 476
Event 476 indicates error 1019 (JET_PageNotInitialized). This error will occur if
a page in the database is expected to be in use, but the page number is zero.
In releases of Exchange prior to Exchange 2003 Service Pack 1, the first four bytes
of each page store the checksum, and the next four bytes store the page number.
If the page number field is all zeroes, then the page is considered uninitialized.
To make room for the ECC checksum in Exchange 2003 Service Pack 1, the page number
field has been converted to the ECC checksum field. The page number is now calculated
as part of the checksum data, and a page is now considered to be uninitialized if
both the original checksum and ECC checksum fields are zeroed.
Event Type: Error
Event Source: ESE
Event ID: 476
Description: Information Store (3500) First Storage Group: The database page read
from the file "C:\mdbdata\priv1.edb" at offset 2121728 (0x0000000000206000) for
4096 (0x00001000) bytes failed verification because it contains no page data. The
read operation will fail with error 1019 (0xfffffc05). If this condition persists
then please restore the database from a previous backup. This problem is likely
due to faulty hardware. Please contact your hardware vendor for further assistance
diagnosing the problem.
In most cases, error 1019 is just a special case of error -1018. However, it could
also be that a logical problem in the database has caused a table to show that an
empty page is in use. Because you cannot tell between these two cases without advanced
logical analysis of the entire database, error 1019 is reported instead of error
-1018.
Error 1019 is rare, and full discussion of analysis and troubleshooting this error
is outside the scope of this paper.
Event 399
Event 399 is a new event that was added in Exchange Server 2003 SP1. It is a Warning
event, and not an Error event. It indicates that a single bit corruption has been
detected and corrected in the database.
Event Type: Warning
Event Source: ESE
Event ID: 399
Description: Information Store (3000) First Storage Group: The database page read
from the file "C:\mdbdata\priv1.edb" at offset 4980736 (0x00000000004c0000) for
4096 (0x00001000) bytes failed verification. Bit 144 was corrupted and has been
corrected. This problem is likely due to faulty hardware and may continue. Transient
failures such as these can be a precursor to a catastrophic failure in the storage
subsystem containing this file. Please contact your hardware vendor for further
assistance diagnosing the problem.
Although Event 399 is a warning rather than an error, it should be monitored for
and treated as seriously as any uncorrectable -1018 error. All -1018 errors indicate
platform instability of one degree or another and may indicate additional errors
will occur in the future.
Event 217
Event 217 indicates backup failure because of a -1018 error.
Event Type: Error
Event Source: ESE
Event ID: 217
Description: Information Store (1224) First Storage Group: Error ( 1018) during backup
of a database (file C:\mdbdata\priv1.edb). The database will be unable to restore.
Immediately before this error occurs, you will typically find a series of 16 event
474 errors in the application log, all for the same page. During backup, Exchange
will retry a page read 16 times, waiting a second or two between each attempt. This
is done in case the error is transient, so that a backup has a better chance to
succeed.
Retries are not done for normal run-time read errors, but only during backup. Performing
retries during normal operation could stall the database, if a frequently accessed
page is involved.
Event 221
Event 221 indicates backup success. It is generated for each database file individually
when it is backed up.
Event Type: Information
Event Source: ESE
Event ID: 221
Description: Information Store (1224) First Storage Group: Ending the backup of the
file C:\mdbdata\priv1.edb.
----------
Event Type: Information
Event Source: ESE
Event ID: 221
Description: Information Store (1224) First Storage Group: Ending the backup of the
file D:\mdbdata\priv1.stm.
If you are using third-party backup applications, there may be additional backup
events that you should monitor in addition to those listed here.
Root Causes
At the simplest level, there are only three root causes for -1018 errors:
-
The underlying platform for your Exchange database has failed to reliably write
Exchange data to storage.
-
The underlying platform for your Exchange database has failed to reliably read Exchange
data from storage.
-
The underlying platform for your Exchange database has failed to reliably preserve
Exchange data while in storage.
This level of analysis defines the scope of the issue. At a practical level, you
want to know:
How Microsoft IT assesses the likelihood of additional errors is described in "Server
Assessment and Root Cause Analysis" later in this document. Recovery strategies
are also described later in this document. This section summarizes the most common
root causes for error -1018:
-
Failing disk drives. Along with simple drive failures, it is not uncommon
for Microsoft Product Support Services to handle cases where rebuilding a redundant
array of independent disks (RAID) drive set after a drive failure is not successful.
-
Hard failures. Sudden interruption of power to the server or the disk subsystem
may result in corruption or loss of recently changed data. Enterprise class server
and storage systems should be able to handle sudden loss of power without corruption
of data. Microsoft has tested Exchange and Exchange servers by unplugging a test
server thousands of times in succession, with no corruption of Exchange data afterward.
Exchange is an application that is well suited to uncovering problems from such
testing because of its transaction log replay behavior and its checksum function.
Damage to Exchange files often becomes evident during post-failure transaction log
replay and recovery, or through verifying checksums on the database files after
a test pass.
For more information about input/output (I/O) atomicity, and its importance for
data integrity after a hard failure, refer to "Best Practices" later in this document.
-
Cluster failovers. As an application is transitioned from one cluster node
to another, disk I/O may be lost or not properly queued during the transition. Even
though individual components may be robust and well designed, they may not work
well together as a cluster system. This is one reason that Microsoft has a qualification
program for cluster systems that is separate from the qualification program for
stand-alone components. The cluster system qualification program tests all critical
components together rather than separately.
-
Resets and other events in the disk subsystem. Companies are increasingly
implementing Storage Area Network (SAN) and other centralized storage technologies,
in which multiple servers access a shared storage frame. Not only is correct configuration
and isolation of disk resources essential in these environments, but you must also
manage redundant I/O paths and an increasing number of filters and services that
are involved in disk I/O. The increasing complexity of the I/O chain necessarily
introduces additional points of failure and exposes poor product integration.
-
Hardware or firmware bugs. Standard diagnostic test runs are seldom successful
in diagnosing these problems. (If the standard diagnostic run could catch this particular
problem, would it not already have been caught?) Understanding these issues frequently
requires correlating data from multiple servers and using specialized diagnostic
suites and stress test harnesses.
This is not a comprehensive list of all causes of error -1018, but it does outline
the problem categories that account for the majority of these errors.
Can Exchange Cause a -1018 Error?
Can Exchange be the root cause of a -1018 error? Exchange might be responsible for
creating a -1018 condition if it did one or both of the following:
-
Constructed the wrong checksum for a page.
-
Constructed a page correctly, but instructed the operating system to write the page
in the wrong location.
The Exchange mechanisms for generating checksums and writing pages back to the database
files are based on simple algorithms that have been stable since the first Exchange
release. Even the addition of the ECC checksum in Exchange Server 2003 SP1 did not
fundamentally alter the page integrity checksum mechanism. The ECC checksum is an
additional checksum placed next to the original corruption detection checksum. The
integrity of Exchange database pages is still verified through the original checksum.
Note: If you use versions of Esefile or Eseutil from versions of Exchange prior to
Exchange Server 2003 SP1 to verify checksums in an Exchange Server 2003 SP1 or later
database, nearly every page of the database will be reported as damaged. The page
format was altered in Exchange Server 2003 SP1 and previous tools cannot read the
page correctly. You must use post-Exchange Server 2003 SP1 tools to verify ECC checksum
pages.
A logical error in the page integrity checksum mechanism would likely result in
reports of massive and immediate corruption of the database, rather than in infrequent
and seemingly random page errors.
This does not mean that there have never been any problems in Exchange that have
resulted in logical data corruption. However, these problems cause different errors
and not a -1018 error. Error -1018 is deliberately scoped to detect logically random
corruptions caused by the underlying platform.
There are a few cases where false positive or false negative -1018 reports have
been caused by a problem in Exchange. In these cases, the checksum mechanism worked
correctly, but there was a problem in a code path for reporting an error. This caused
a -1018 error to be reported when there was no problem, or an error to not be reported
that should have been. Examination of the affected databases quickly leads to resolution
of such issues.
The Exchange transaction log file replay capability is another capability that allows
Microsoft to effectively diagnose -1018 errors that may be the fault of Exchange.
Recall from the previous section that online backups are not allowed to complete
if -1018 problems exist in the database. In addition, after restoration of a backup,
transaction log replay re-creates every change that happened subsequent to the backup.
This allows Exchange development to start from a known good copy of the database
and trace every change to it.
As an Exchange administrator, the following two symptoms indicate that Exchange
should be looked at more closely as the possible cause of a -1018 error:
-
After restoration of an online backup, and before transaction log file replay begins,
there is a -1018 error in the restored database files. This could indicate that
checksum verification failed to work correctly during backup. It is also possible
that the backup media has gone bad, or that data was damaged while being restored
because of failing hardware. The next test is more conclusive.
-
After checksum verification of restored databases, a -1018 error is present after
successful transaction log replay has completed. This could indicate that a logical
problem resulted in generation of an incorrect checksum. Reproducing this problem
consistently on different hardware will rule out the possibility that failing hardware
further damaged the files during the restoration and replay process.
Conversely, if restoring from the backup and rolling forward the log files eliminate
a -1018 error, this is a strong indication that damage to the database was caused
by an external problem.
In summary, error -1018 is scoped to report only two specific types of data corruption:
-
A logical page number recorded on a page is nonzero and does not match the physical
location of the page in the database.
-
The checksum recorded on a page does not match the actual data recorded on the page.
Exchange thus detects both corruption of the data on a page and guards against the
possibility that a page in the database has been written in the wrong place.
How Microsoft IT Responds To Error 1018
Microsoft IT uses Microsoft Operations Manager (MOM) 2005 to monitor the health
and performance of Microsoft Exchange servers. MOM sends alerts to operator consoles
for critical errors, including error -1018.
MOM provides enterprise-class operations management to improve the efficiency of
IT operations. You can learn more about MOM at the
Microsoft Windows Server System Web site.
At Microsoft, automatic e-mail notifications are sent to a select group of hardware
analysts whenever a -1018 occurs. Thus, all -1018 errors are investigated by an
experienced group of people who track the errors over time and across all servers.
As you will see later in this document, this approach is an important part of the
methodology at Microsoft for handling -1018 errors.
Monitoring Backup Success
Every organization, regardless of size, should monitor Exchange servers for error
-1018. The most basic way to accomplish this, if your organization does not use
a monitoring application such as MOM, is to verify the success of each Exchange
online backup. Even if you do use MOM, you should still monitor backup success separately.
If Exchange online backups are failing unnoticed, you are at risk on at least these
counts:
-
A common reason for backup failure is that the database has been damaged.
Thus, the Exchange platform may be at risk of sudden failure.
-
>You do not have a recent known good backup of critical Exchange data.
While an older backup can be rolled forward with additional transaction logs for
zero loss restoration, the older the backup, the less likely this will be successful,
for a number of operational reasons. For example, an older backup tape may be inadvertently
recycled. In addition, if the platform issues on the Exchange server result in loss
of the transaction logs, rolling forward will be impossible.
-
>After successful completion of an online backup, excess transaction logs are
automatically removed from the disk. With backups not completing, transaction
log files will remain on the disk, and you are at risk for eventually running out
of disk space on the transaction log drive. This will force dismount of all databases
in the affected storage group. (If a transaction log drive becomes full, do not
simply delete all the log files. Instead, refer to the Microsoft Knowledge Base
article "How to Tell Which Transaction
Log Files Can Be Safely Removed".
Verifying backup success is arguably the single most important monitoring task for
an Exchange administrator.
As a best practice, Microsoft IT not only sets notifications and alerts for backup
errors and failures, but also for backup successes. A daily report for each database
is generated and reviewed by management. This review ensures that there is positive
confirmation that each database has actually been backed up recently, and that there
is immediate attention to each backup failure.
Securing Data after a -1018 Error
The most common way that a -1018 error comes to the attention of Microsoft IT analysts
is through a backup failure. While a -1018 error may occur during normal database
operation, normal run-time -1018 errors are less frequent than errors during backup.
Note: Exchange databases perform several self-maintenance tasks on a regular schedule
(which can be set by the administrator). One of these tasks, called online defragmentation,
consolidates and moves pages within the database for better efficiency. Thus, error
-1018 may be reported more frequently during the online maintenance window than
during normal run time.
This is the general process that occurs at Microsoft after a -1018 error:
-
MOM alerts are generated and e-mail notification is sent to Exchange analysts.
-
Verification is done that recent good backups exist for all databases on the server.
It is important that backups are good for all databases, and not just the database
affected by the -1018, because the error indicates that the entire server is at
risk.
-
All transaction log files on the server are copied to a remote location, in case
there is a failure of the transaction log drive. As the investigation proceeds,
new log files are periodically copied to a safe location or backed up incrementally.
You can copy log files to a safe location by doing an incremental or differential
online backup. In Exchange backup terminology, an incremental or differential backup
is one that backs up only transaction log files and not database files. An incremental
backup removes logs from the disk after copying them to the backup media. A differential
backup leaves logs on the disk after copying them to the backup media.
After existing Exchange data has been verified to be recoverable and safe, it is
time to begin assessing the server and performing root cause analysis.
Server Assessment and Root Cause Analysis
There are two levels at which you must gauge the seriousness of a -1018 error:
These two factors are independent of each other. Ignoring a -1018 error because
the damaged page is not an important one is a mistake. The next page destroyed may
be critical and may result in a sudden catastrophic failure of the database.
There are two common analysis and recovery scenarios for a -1018 condition:
-
There is only a single error, and little or no immediate impact on the overall functioning
of the server. You have time to do careful diagnosis, and plan and schedule a low-impact
recovery strategy. However, root cause analysis is likely to be difficult because
the server is not showing obvious signs of failure beyond the presence of the error.
-
There are multiple damaged pages or the error occurs in conjunction with other significant
failures on the server. You are in an emergency recovery situation.
In the majority of emergency recovery situations, root cause analysis is simple
because there is a strong likelihood that the -1018 was caused by a catastrophic
or obvious hardware failure. Even in an emergency situation, you should take the
time to preserve basic information about the error that is needed for statistical
trending across servers. For more information, see "Appendix B: -1018 Record Keeping"
later in this document.
Even before root cause analysis, your first priority should be to make sure that
existing data has already been backed up and that current transaction log files
are protected. Then you can begin analysis with bookending.
Bookending
The point at which a page was actually damaged and the point at which a -1018 was
reported may be far apart in time. This is because a -1018 error will only be reported
when a page is read by the database. Bookending is the process of bracketing the
range of time in which the damage must have occurred.
The beginning bookend is the time the last good online backup was done of the database
(marked by event 221). Because the entire database is checked for -1018 problems
during backup, you know that the problem did not occur until after the backup occurred.
The other bookend is the time at which the -1018 error was actually reported in
the application log. Frequently, this will be a backup failure error (event 217).
The event that caused the -1018 error must have occurred between these two points
in time.
After you have established your bookends, the next task is to look for what else
happened on the server during this time that may be responsible for the -1018 error:
-
Was there a hard server or disk failure?
-
Was the server restarted (event 6008 in the system log)?
-
Were Exchange services stopped and restarted?
-
Have there been any storage-related errors? This includes memory, disk, multipath,
and controller errors. Not only should you search the Windows system log, but you
should also be aware of other logging mechanisms that may be used by the disk system.
Many external storage controllers do not log events to the Windows system or application
logs, and, by default, the controller may not be set up to log errors. You must
ensure that error logging is enabled and that you can locate and interpret the logs.
-
Did Chkdsk run against any of the volumes holding Exchange data?
-
If this is a clustered server, were there any failovers or other cluster state transitions?
-
Have any hardware changes been made, or has firmware or software been upgraded on
the server?
Any unusual event that occurred between the bookend times must be considered suspect.
If there are no unusual events that can account for damage to the database files,
you must consider the possibility that there is an undetected problem with the reliability
of the underlying platform.
It is also possible that the error is due to a transient condition external to the
hardware. A variety of environmental factors can corrupt computer data or cause
transient malfunctions. Vibration, heat, power fluctuations, and even cosmic rays
are known to cause glitches or even permanent damage. Hard drive manufacturers are
well aware that normally functioning drives are not 100 percent reliable in their
ability to read, write, and store data, and design their systems to detect errors
and rewrite corrupted data.
Keeping in mind that no computer storage system is 100 percent reliable, how can
you decide whether a -1018 is indicative of an underlying problem that you should
address, or is just a random occurrence that you should accept?
A Microsoft Senior Storage Technologist who has extensive experience in root cause
analysis of disk failures and Exchange -1018 errors, suggests this principle: For
100 Exchange servers running on similar hardware, you should experience no more
than a single -1018 error in a year. The phrase running on similar hardware is important
in understanding the proper application of this principle.
Standardizing on a single hardware platform for Exchange is useful in root cause
analysis of 1018 errors. In the absence of an obvious root cause, the next step
of investigation is to look for patterns of errors across similar servers.
A single -1018 error on a single page may be a random event. Only after another
-1018 error occurs on the same or a similar server do you have enough information
to begin looking for a trend or common cause. If a -1018 error occurs on two servers
that have nothing in common, you have two errors that have nothing in common rather
than two data points that may reveal a pattern.
As a general rule, if you average less than one -1018 error across 100 servers of
the same type per year, it is unlikely that root cause analysis will reveal an actionable
problem.
This does not mean that you should not record data for each -1018 error that occurs
on a server. Until a second error has occurred, you cannot know whether a particular
error falls below the threshold of this principle.
If a -1018 error is caused by a subtle hardware problem, providing data from multiple
errors can be critical. With only a single error to consider, it is likely to be
difficult for Microsoft or a hardware vendor to identify a root cause beyond what
you can identify on your own. For two actual -1018 root cause investigations, and
examples of how difficult and subtle some issues can be to analyze, see "Appendix
A: Case Studies" later in this document.
Detailed information about every -1018 error that happens at Microsoft is logged
into a spreadsheet as described in "Appendix B: -1018 Record Keeping " later in
this document.
Verifying the Extent of Damage
Error -1018 applies to problems on individual pages in the database, and not to
the database as a whole. When a -1018 error is reported, you cannot assume that
the reported page is the only one damaged. Because a backup will stop at the moment
the first -1018 is encountered, you cannot even rely on errors reported during the
backup to show you the full extent of the damage.
You need to know how many pages are damaged in the database as part of deciding
on a recovery strategy. If multiple pages are damaged, multiple errors have likely
occurred, and the platform should be considered in imminent danger of complete failure.
In the majority of -1018 cases investigated by Microsoft IT, there is only a single
damaged page in the database. In this circumstance, absent other indications of
an underlying problem, Microsoft IT will leave the server in service and wait to
implement recovery until a convenient off-peak time. The assumption is that this
is a random error, unless a second error in the near future or similar issues on
other servers indicate a trend.
Note: Remember that an error -1018 prevents an online backup from completing. Delaying
recovery of the database will require you to recover with an increasingly out-of-date
backup. This situation will definitely result in longer downtime during recovery,
because of additional transaction log files that must be replayed. In Exchange Server
2003 SP1, the typical performance of log file replay is better than 1,000 log files
per hour with that performance remaining consistent, regardless of the number of
log files that must be replayed. In prior versions of Exchange, transaction log
file replay can be more than five times slower, with the average speed of replay
tending to diminish as more logs are replayed.
Comprehensively testing an entire database for -1018 pages requires taking the database
offline and running Eseutil in checksum mode.
If you bring a database down after a -1018 error has occurred, there is some chance
that it will not start again. If other, unknown pages have also been damaged, one
of them could be critical to the startup of the database. Statistically, this is
a low probability risk, and Microsoft IT does not hesitate to dismount databases
that have displayed run-time -1018 errors.
Eseutil is installed in the \exchsrvr\bin folder of every Exchange server. When
run in checksum mode (with the /K switch), Eseutil rapidly scans every page in the
database and verifies whether the page is in the right place and whether its checksum
is correct. Eseutil runs as rapidly as possible without regard to other applications
running on the server. Running Eseutil /K on a database drive shared with other
databases is likely to adversely affect the performance of other running databases.
Therefore, you should schedule testing of a database for off-peak hours whenever
possible.
Note: If you decide to copy Exchange databases to different hardware to safeguard
them, be sure that you copy them rather than move them. The problems on the current
platform may not be in the disk system, but may cause corruption to occur during
the move process. If you move the files, you get no second chance if this corruption
happens.
At Microsoft, Eseutil checksum verification is done by running multiple copies of
Eseutil simultaneously against the database. One instance of Eseutil /K is started
against the database, and after a minute, another instance is started against the
same database. The reason for doing this is that in a mirror set, one side of the
mirror may have a bad page, but the other side may not.
Running two copies of Eseutil slightly out of synch with each other makes it much
more likely that both sides of a mirror will be read. It is not often that one side
of a mirror is good and one side is bad, but it does happen, and a thorough test
requires testing both sides of the mirror. At Microsoft, this Eseutil regimen is
also run five times in succession, to further increase the confidence level in the
results.
Note: Multiple runs of Eseutil /K are unnecessary if databases are stored on a RAID-5
stripe set, where data is striped with parity across multiple disks. This is because
there is only one copy of a particular page in the set, with redundancy being achieved
by the ability to rebuild the contents of a lost drive from the parity. Also, note
that as a general rule, RAID 1 (Mirroring) or RAID 1+0 (Mirroring+Striping) drive
sets are recommended for heavily loaded database drives for performance reasons.
Recovering from a -1018 Error
Microsoft IT undertakes two fundamental tasks to recover from a -1018 error:
These tasks are not completely independent of each other. What is discovered about
the root cause may influence the data recovery strategy.
For example, if there are overt signs that server hardware is in imminent danger
of complete failure, the data recovery strategy may require immediate data migration
to a different server. If the server appears to be otherwise stable, data recovery
may consist merely of restoring from backup, to remove the bad page from the database.
Server Recovery
At Microsoft, a single -1018 error puts a server on a watch list. It does not trigger
replacement or retirement of the hardware unless there has been positive identification
of the component that caused the error. If additional -1018 errors occur on the
same server in the near future, regardless of whether the root cause has been specifically
determined, the server is treated as untrustworthy. It is taken out of production
and extensive testing is done.
It may seem obvious that after any -1018 error occurs, you should immediately take
the server down and run a complete suite of manufacturer diagnostics. Yet this is
not something that Microsoft IT does as a matter of course. The reason is that standard
diagnostic tests are seldom successful in uncovering the root cause of a -1018 error.
This is because:
-
The corruption may be an anomaly. Power fluctuations and interference, temporary
rises in heat or humidity, and even cosmic rays can corrupt computer data. Unless
these conditions are repeated at the time the test is run, the test will show nothing.
-
If a -1018 error occurs only once and is not accompanied by any other visible errors
or issues, it is probable that the server is currently functioning normally.
The condition that caused the problem may occur infrequently or require a particular
confluence of circumstances that cannot be replicated by a general diagnostic tool.
-
Hardware frequently fails in an intermittent rather than steady or progressive pattern.
-
The problem may be the result of a subtle hardware or firmware bug rather than due
to a progressively failing component. In this case, ordinary manufacturer
diagnostics may be incapable of uncovering the issue. If these diagnostics could
detect the issue, it would have already been uncovered in a previous diagnostic
run.
-
The problem may be a Heisenberg. The term Heisenberg refers to a problem
that cannot be reproduced because the diagnostic tools used to observe the system
change the system enough that the problem no longer occurs. For example, a tool
that monitors the contents of RAM may slow down processing enough that timing tolerances
are no longer exceeded, and the problem disappears.
-
The diagnostic tool may not be able to simulate a load against the server that is
sufficiently complex. There is a misconception that -1018 errors are more
likely to appear when you place a system under a heavy I/O load. The experience
at Microsoft is that the complexity of the load is more relevant to exposing a data
corruption issue than is the overall level of load. Complexity can be in the type
of access (the I/O size combined with direction), as well as in the actual data
content patterns. Certain complex patterns can show noise or crosstalk problems
that will not be exposed by simpler patterns. One of the strengths of the Finisar
Medusa Labs Test Tool Suite is its ability to generate such patterns.
Manufacturer diagnostics are typically run only after the server has already been
taken out of production. This happens after a pattern of -1018 errors has established
that an underlying problem exists, but the root cause has not yet been discovered.
Along with these diagnostics, Microsoft IT also tries to reproduce data corruption
problems by using tools that stress the disk subsystem.
The Jetstress (Jetstress.exe) and Exchange Server Load Simulator (LoadSim) tools
can be used to realistically simulate the I/O load demands of an actual Exchange
server. The primary function of these tools is for capacity planning and validation,
but they are also useful for testing hardware capabilities.
Jetstress creates several Exchange databases and then exercises the databases with
realistic Exchange database I/O requests. This approach allows determining whether
the I/O bandwidth of the disk system is sufficient for its intended use.
LoadSim simulates Messaging application programming interface (MAPI) client (Microsoft
Office Outlook 2003) activity against an Exchange server and is useful for judging
the overall performance of the server and network. LoadSim requires additional client
workstation computers to present high levels of client load to the server.
While neither tool is intended as a disk diagnostic tool, both can be used to create
large amounts of realistic Exchange disk I/O. For this purpose, most people prefer
Jetstress because it is simpler to set up and tune. Both Jetstress and LoadSim come
with extensive documentation and setup guidance and are available free for download
from Microsoft. You can
download Jetstress from the Microsoft Download Center. You can download
LoadSim from the
Microsoft Windows Server System Web site.
Microsoft IT also uses the Medusa Labs Test Tools Suite from Finisar for advanced
stress testing of disk systems. The Finisar tools can generate complex and specific
I/O patterns, and are designed for testing the reliability of enterprise-class systems
and storage. While Jetstress and LoadSim are capable of generating realistic Exchange
server loads, the Finisar tools generate more complex and demanding I/O patterns
that can uncover subtle data and signal integrity issues.
For detailed information about the Medusa Labs Test Tools Suite, see the
Finisar Web site.
Use of Jetstress, LoadSim, or the Medusa tools requires that the server be taken
out of production service. Each of these tools, used in a stress test configuration,
makes the server unusable for other purposes while the tests are running.
The Eseutil checksum function is also sometimes useful in reproducing unreliability
in the disk system. Eseutil scans through a database as quickly as possible, reading
each page and calculating the checksum that should be on it. It will use all the
disk I/O bandwidth available. This puts significant I/O load on the server, although
not a particularly complex load. If successive Eseutil runs report different damage
to pages, this indicates unreliability in the disk system. This is a simple test
to uncover relatively obvious problems. A disk system that fails this test should
not be relied on to host Exchange data in production. However, the Eseutil checksum
function is unlikely to reveal subtle problems in the system.
Another test that is frequently done is to copy a large checksum-verified file (such
as an Exchange database) from one disk to another. If the file copy fails with errors,
or the copied file is not identical to the source, this is a strong indication of
serious disk-related problems on the server.
As a final note about server recovery, you should verify that the Exchange server
and disk subsystem are running with the latest firmware and drivers recommended
by the manufacturers. If they are not, it is possible that upgrading will resolve
the underlying problem.
Microsoft works closely with manufacturers when -1018 patterns are correlated with
particular components or configurations, and hardware manufacturers are continually
improving and upgrading their systems. In rare cases, you may discover that -1018
errors begin occurring soon after application of a new driver or upgrade. This is
another case where a standardized hardware platform can make troubleshooting and
recognizing patterns easier.
Data Recovery
The first—if somewhat obvious—question to answer when deciding on a data recovery
strategy is this: Is the database still running?
If the database is running, you know that the error has not damaged parts of the
database critical to its continuing operation. While some user data may have already
been lost, it is likely that the scope of the loss is limited.
The next question is: Do you believe the server is still reliable enough to remain
in production?
At Microsoft, if a single -1018 occurs on a server but there is no other indication
of system instability, the server is deemed healthy enough to remain in production
indefinitely. This conclusion is subject to the appearance of additional errors.
Before deciding on a data recovery strategy, you must assess the urgency with which
the strategy must be executed. Along with the current state of the database, what
you have learned already from the root cause analysis will factor heavily into this
assessment. The following questions must be considered:
-
Has more than one error occurred? If multiple errors have occurred, or additional
errors are occurring during your troubleshooting, you should consider it highly
likely that the entire platform may suddenly fail.
-
Is more than one database involved?
-
Is the platform obviously unstable? For example, suppose that you find during
root cause analysis that you cannot copy large files to the affected disk without
errors during the copy. It becomes much more urgent at this point to move the databases
to a different platform immediately.
-
Is there a recent backup of the affected data? If you have not been monitoring
backup success, backups may have been failing for days or weeks because the database
was already damaged. You are at even greater risk if there is a sudden failure of
the server.
If you do not have a good, recent online backup, you must make it a high priority
to shut down the databases and copy the database files from the server to a safe
location. If you do not have a recent online backup, and if you do not make an offline
backup, you run the risk that subsequent damage to the database will make it irreparable
and result in catastrophic data loss.
While it is true that the database is already damaged, it can be repaired with Eseutil,
as long as the damage does not become too extensive. More detail about repairing
the database is provided later in this document.
Microsoft IT chooses from several standard strategies to recover a database after
a -1018 error occurs. The next sections outline the advantages and disadvantages
of each strategy, along with the preconditions required to use the strategy.
Restore from Backup
Restoring from a known good backup and rolling the database forward is the only
strategy guaranteed to result in zero data loss regardless of how many database
pages have been damaged. This strategy requires the availability and integrity of
all transaction logs from the time of backup to the present.
The reason that this strategy results in zero data loss is that after Exchange detects
non-transient -1018 damage on a page, the page is never again used or updated. One
of two conditions applies: either the backup copy of the database already carries
the most current version of the page, or one of the transaction logs after the point
of backup carries the last change made to the page before it was damaged. Thus,
restoring and rolling forward expunges the bad page with no data loss.
Note: Before restoring from a backup, you should always make a copy of the current
database. Even if the database is damaged, it may be repairable. If you restore
from a backup, the current database will be overwritten at the beginning of the
restoration process. If restoration fails, and you have a copy of the damaged database,
you can then fall back on repairing the database as your recovery strategy.
Restoration from a backup is the method used the majority of time by Microsoft IT
to recover from a -1018 error. Each Exchange database at Microsoft is sized so that
it can be restored in about an hour.
Restoration is also much faster than other recovery strategies. Assuming that the
server is deemed stable enough, restoration is scheduled for an off-peak time, and
results in minimal disruption for end users. For more information about how Microsoft
backs up Exchange, refer to the IT Showcase paper
Backup Process Used with Clustered Exchange Server 2003 Servers at Microsoft.
Migrate to a New Database
Exchange System Manager provides the Move Mailbox facility for moving all mailbox
data from one database or server to another. This can be done while the database
is online, and even while users are logged on to their mailboxes. However, most
Exchange administrators prefer to schedule a general outage when moving mailboxes
so that individual users do not experience a short disconnect when each mailbox
is moved.
In Exchange Server 2003, mailbox moves can be scheduled and batched. In conjunction
with Microsoft Office Outlook's Exchange Cached Mode, the interruption in service
when each mailbox is moved often goes unnoticed by end users, who can continue to
work from a cached copy of the mailbox.
For public folder databases, each folder can be migrated to a different server by
replication. If additional replicas of all folders already exist on other servers,
you can migrate all data by removing all replicas from the problem database. This
will trigger a final synchronization of folders from this database to the other
replicas.
After Exchange System Manager shows that replication has finished for all folders
in a public folder database, you may delete the original database files. When you
mount a database again after deleting its files, a new, empty database is generated.
You can then replicate folders from other public folder servers back to this new
database, if desired.
Migrating data to a different database leaves behind any -1018 or 1019 problems
because bad pages will not be used during the move or replication operations. Unlike
using a restore and roll forward strategy, migrating data will not recover the information
that was on the bad page. It will definitely leave the bad data behind.
A particular message, folder, or mailbox may fail to move, and you may notice a
simultaneous -1018 error in the application log. This can allow you to identify
the error and the data affected by it. In Exchange Server 2003, new move mailbox
logging can report details about each message that fails to move, or can skip mailboxes
that show errors during a mass move operation. For more details about configuring,
batching, and logging mailbox move operations, refer to Exchange Server 2003 online
Help.
Sometimes, a single bad page can affect multiple users. This is because of single
instance storage. In an Exchange database, if a copy of a message is sent to multiple
users, only one copy of the message is stored, and all users share a link to it.
Sometimes, the data migration will complete with no errors, even though you know
there are -1018 problems in the database. This will happen if the bad page is in
a database structure such as a secondary index. Such structures are not moved, but
are rebuilt after data is migrated. If the Move Mailbox or replication operations
complete with no errors, this indicates the bad page was in a section of the database
that could be reconstructed, or in a structure such as a secondary index that could
be discarded. In these cases, migrating from the database does result in a zero
data loss recovery.
Moving or replicating all the data in a 50-gigabyte (GB) database can take a day
or two. Therefore, if you choose a migration strategy, you must believe that the
server is stable enough to remain in service long enough to complete the operation.
Repair the Database
The Eseutil and Information Store Integrity Checker (Isinteg.exe) tools are installed
on every Exchange server and administrative workstation. These tools can be used
to delete bad pages from the database and restore logical consistency to the remaining
data.
Repairing a database typically results in some loss of data. Because Exchange treats
a bad page as completely unreadable, nothing that was on the page will be salvaged
by a repair. In some cases, repair may be possible with zero data loss, if the bad
page is in a structure that can be discarded or reconstructed. The majority of pages
in an Exchange database contain user data. Therefore, the chance that a repair will
result in zero data loss is low.
Repair is a multiple stage procedure:
-
Make a copy of the database files in a safe, stable location.
-
Run Eseutil in repair mode (/P command-line switch). This removes bad pages and
restores logical consistency to individual database tables.
-
Run Eseutil in defragmentation mode (/D command-line switch). This rebuilds secondary
indexes and space trees in the database.
-
Run Isinteg in fix mode (-Fix command-line switch). This restores logical consistency
to the database at the application level. For example, if several messages were
lost during repair, Isinteg will adjust folder item counts to reflect this, and
will remove missing message header lines from folders.
Typically, repairing a database takes much longer than restoring it from a backup
and rolling it forward. The amount of time required varies depending on the nature
of the damage and the performance of the system. As an estimate, the repair process
often takes about one hour per 10 GB of data. However, it is not uncommon for it
to be several times faster or slower than this estimate.
Repair also requires additional disk space for its operations. You must have space
equivalent to the size of the database files. If this space is not available on
the same drive, you can specify temporary files on other drives or servers, but
doing so will dramatically reduce the speed of repair.
Because repair is slow and usually results in some data loss, it should be used
as a recovery strategy only when you cannot restore from a backup and roll the database
forward.
There may be cases where you have a good backup, but are unable to roll the database
forward. You can then combine the restoration and repair strategies to recover the
maximal amount of data. This option is explored in more detail in the next section.
The database repair tools have been refined and improved continually since the first
version of Exchange was released, and they are typically effective in restoring
full logical consistency to a database. Despite the effectiveness of repair, Microsoft
IT considers repair an emergency strategy to be used only if restoration is impossible.
Because Microsoft IT is stringent about Exchange backup procedures, repair is almost
never used except as part of the hybrid strategy described in the next section.
After repairing a database, Microsoft IT policy is to migrate all folders or mailboxes
to a new database rather than to run a repaired database indefinitely in production.
Restore, Repair, and Merge Data
There is a hybrid recovery strategy that can be used if you are unable to roll forward
with a restored database because a disaster has destroyed necessary transaction
log files.
In this scenario, an older, but good, copy of the database is restored from a backup.
Because the transaction logs needed for zero loss recovery are unavailable, the
restored database is missing all changes since the backup was taken.
However, the damaged database likely contains the majority of this missing data.
The goal is to merge the contents of the damaged database with the restored database,
thus recovering with minimal data loss.
To do this, the damaged database is moved to an alternate location where it can
be repaired while the restored database is running and servicing users. In Exchange
Server 2003, you can use the recovery storage group feature to do the restoration
and repair on the same server. In previous versions of Exchange, it was necessary
to copy the database to a different server to repair it and merge data.
Bulk merge of data between mailbox databases can be accomplished in two ways:
-
Run the Mailbox Merge Wizard (ExMerge). You can
download ExMerge from the Microsoft Download Center. ExMerge will copy mailbox
contents between databases, suppressing copying of duplicate messages, and allowing
you to filter the data merge based on timestamps, folders, and other criteria. ExMerge
is a powerful and sophisticated tool for extracting and importing mailbox data.
-
Use the Recovery Storage Group Wizard in Exchange System Administrator In Exchange
Server 2003 SP1. The Recovery Storage Group Wizard merges mailbox contents
from a database mounted in the recovery storage group to a mounted copy of the original
database. Like ExMerge, the Recovery Storage Group Wizard suppresses duplicates,
but it does not provide other filtering choices. For the majority of data salvage
operations, duplicate suppression is all that is required. In most cases, the Recovery
Storage Group Wizard provides core ExMerge functionality, but is simpler to use.
Alternate Server Restoration
Exchange allows restoration of a backup created on one server to a different server.
In this scenario, you create a storage group and database on the destination server,
and restore the backup to it. You can also copy log files from one server to another
to roll the database forward.
This strategy may be necessary if the original server is deteriorating rapidly,
and you must find an alternate location quickly to host the database. You can restore
either an online backup or offline copies of the databases to the alternate server.
After the database has been restored, you must redirect Active Directory® directory
service accounts to the mailboxes now homed on the new server. This can be done
by:
-
In Exchange Server 2003, use the Remove Exchange Attributes task for all users with
mailboxes in the database, followed by using the Mailbox Recovery Center to automatically
reconnect all users to the mailboxes on the new server.
-
Use a script for the Active Directory attribute changes to redirect Active Directory
accounts to the new server.
This is an advanced strategy. You may want to consult with Microsoft Product Support
Services if it becomes necessary to use it, and you have not successfully accomplished
it in the past. This strategy may also require the editing or re-creation of client
Outlook profiles.
Best Practices
Microsoft IT manages approximately 95 Exchange mailbox servers that host 100,000
mailboxes worldwide. In the last year, there have been six occurrences of error
-1018 across all these servers, with the errors limited to two servers.
One server had four errors and another had two errors. In the first case, the root
cause was traced to a specific hardware failure. The second server is still under
investigation because the two errors occurred very close together in time, but have
not occurred since.
Microsoft IT has seen a general trend of decreasing numbers of -1018 errors year
over year. This corresponds with the experience of many Exchange administrators
who see fewer -1018 errors in Exchange today than in years past. Administrators
often assume that the decrease in these errors must be due to improvements in Exchange.
However, the credit really belongs to hardware vendors who are continually increasing
the reliability and scalability of their products. Microsoft's primary contribution
has been to point out problems that the vendors have then solved.
Along with using reliable enterprise-class hardware for your Exchange system, there
are several best practices used by Microsoft IT that you can implement to reduce
even further the likelihood of encountering data file corruption.
Hardware Configuration and Maintenance
Follow these best practices:
-
Disable hardware write back caching on disk drives that contain Exchange data, or
ensure you have a reliable controller that can maintain its cache state if power
is interrupted.
It is important to distinguish here between caching on a disk drive and caching
on a disk controller. You should always disable write back caching on a disk drive
that hosts Exchange data, but you may enable it on the disk controller if the controller
can preserve the cache contents through a power outage.
When Exchange writes to its transaction logs and database files, it orders the operating
system to flush those writes to disk immediately. Nearly all modern disk controllers
report to the operating system that writes have been flushed to a disk before they
actually have. This means that disks and controllers must ensure that writes have
succeeded in case there is a power outage. There is nothing an application can do
to reliably override disk system behavior and actually force writes to be secured
to a disk.
-
Change cache batteries in disk controllers, uninterruptible power supplies (UPSs),
and other power interruption recovery systems as manufacturers recommend. A failed
battery is a common reason for data corruption after a power failure.
-
Test systems before putting them in production. Microsoft IT uses Jetstress for
burn-in testing of new Exchange systems. The Medusa Labs Test Tool Suite from Finisar
is normally used in Microsoft IT only for advanced forensic analysis after less
sophisticated tools have not been able to reproduce a problem.
-
Test the actual drive rebuild and hot swap capabilities of your disk system for
both performance and data integrity reasons. It is possible that the performance
of a system will be so greatly impacted during a drive rebuild operation that it
becomes unusable. There have also been cases where the drive rebuild functionality
has become unstable when disks have remained under heavy load during a drive rebuild
operation.
-
Power down server and disk systems in the order and by the methods recommended by
manufacturers. You should know the expected shutdown times for your systems, and
at which points a hard shutdown is safe or risky. Many server systems take much
longer to shut down than consumer computer systems. The experience of Microsoft
Product Support Services is that impatience during shutdown is an all too common
cause of data corruption.
-
Standardize the hardware platform used for Exchange. Not only does this improve
general server manageability, but it also makes troubleshooting and analysis of
errors across servers easier.
-
Stay current on upgrades for servers, disk controllers, switches and other firmware,
and software that manage disks and disk I/O.
-
Verify with your vendor that the disk controllers used with Exchange support atomic
I/O, and find out the atomicity value.
To support atomic I/O is to support writing all of the data that an application
requests in a single I/O or to write none of it. For example, if an application
sends a 64-KB write to a disk, and a hard failure occurs during the write, the result
should be that none of the write is preserved on a disk. Atomicity involves all
or nothing
Without atomic I/O, you are vulnerable to torn pages where a chunk of disk may be
composed of a mixture of old and new data. In the 64 KB example, it may be that
the first 32 KB is new data and the last 32 KB is old data. In Exchange, a torn
4-KB write to the database will certainly result in a -1018 error.
The atomicity value refers to the largest single write that the controller guarantees
to write on an all or nothing basis. For example, this might be 128 KB: for any
I/O request less than 128 KB, the write will happen atomically, or, in effect all
at once with no possibility of a partial write. However, for write requests greater
than 128 KB, there may be no such guarantee.
Exchange issues database write commands in 4 KB or smaller chunks. Therefore, on
a drive hosting only Exchange databases, a write atomicity of 4 KB is required.
Operations
Follow these best practices:
-
Place Exchange databases and transaction log files in separate disk groups. As a
rule, Exchange log files should never be placed on the same physical drives as Exchange
database files. There are two important reasons for this:
-
Fault tolerance. If the disks hosting Exchange database files also hold the transaction
logs, loss of these disks will result in loss of both database and transaction log
files. This will make rolling the database forward from a backup impossible.
-
Performance. The disk I/O characteristics for an Exchange database are a high amount
of random 4-KB reads and writes, typically with twice as many reads as writes. For
Exchange transaction log files, I/O is sequential and consists only of writes. As
a rule, mixing sequential and random I/O streams to the same disk results in significant
performance degradation.
-
Track all Exchange data corruption issues across all Exchange servers. This provides
you data for trend analysis and troubleshooting of subtle platform flaws. For more
information, see "Appendix B: -1018 Record Keeping " later in this document.
-
Preserve Windows event logs. It is all too common for event logs generated during
the bookend period to be cleared or automatically overwritten. (For details, see
"Bookending" earlier in this document.) The event logs are important for root cause
analysis. If you are running Exchange in a cluster, ensure that event log replication
is configured, or that you gather and preserve the event logs from every node in
the cluster, whether actively running Exchange or not.
Conclusion
For most organizations, huge amounts of important data are managed in Microsoft
Exchange database files. Current server class computer hardware is very reliable
but it is not perfect. Because Exchange data files compose many gigabytes or even
terabytes of storage, it is inevitable that the database files will occasionally
be damaged by storage failures.
While no administrator welcomes the appearance of a -1018 error, the error prevents
data corruption from going undetected, and often provides you with an early warning
before problems become serious enough that a catastrophic failure occurs.
Every -1018 error should be logged (as described in Appendix B). Moreover, every
-1018 requires some kind of recovery strategy to restore data integrity (as described
above in "Recovering from a -1018 Error"). However, not every -1018 error indicates
failing or defective hardware.
At Microsoft, a rate of one error -1018 per 100 Exchange servers per year is considered
normal and to be expected. This "1 in 100" acceptable error rate is based on Microsoft's
experience with the limits of hardware reliability.
Microsoft IT will replace hardware or undertake a root cause investigation if any
of the following conditions exist:
-
The -1018 error is associated with other errors or symptoms that indicate failures
or defects in the system.
-
More than one -1018 error has occurred on the same system.
-
1018 errors begin occurring above the "1 in 100" threshold on multiple systems of
the same type.
While there may be nothing you can do about the fact that -1018 errors occur, you
can reduce the incidence of errors. If you are experiencing -1018 errors at a rate
greater than one or two a year per 100 Exchange servers, the root cause analysis
advice and practices outlined in this paper can be of practical benefit to you.
Even if you are not experiencing excessive rates of this problem, we hope that the
recovery methods suggested in this paper will help you recover more quickly and
effectively
For More Information
For more information about Microsoft products or services, call the Microsoft Sales
Information Center at (800) 426-9400. In Canada, call the Microsoft Canada information
Centre at (800) 563-9048. Outside the 50 United States and Canada, please contact
your local Microsoft subsidiary. To access information through the World Wide Web,
go to:
http://www.microsoft.com
http://www.microsoft.com/itshowcase
http://www.microsoft.com/technet/itshowcase
For any questions, comments, or suggestions on this document, or to obtain additional
information about How Microsoft Does IT, please send e-mail to:
showcase@microsoft.com
Appendix A: Case Studies
This section outlines two case studies of actual -1018 investigations, conducted
jointly by Microsoft, third-party vendors, and Exchange customers. For privacy reasons,
the names of the customers and vendors are omitted, and identifying details may
have been changed.
These investigations are not typical of what is required to identify the root cause
for the majority of -1018 errors. Rather, they illustrate the more subtle and difficult
cases that are sometimes encountered. In both cases, trending -1018 errors across
a common platform was critical to the investigation.
Case Study 1
An Exchange customer with nearly 100 Exchange servers in production was experiencing
occasional but recurring -1018 errors on a minority of the servers. All servers
used for Exchange were from the same manufacturer, with two different models used
depending on the role and load of the server. Errors occurred, seemingly at random,
in both server models.
Ordinary diagnostics showed nothing wrong with any of the servers. If a -1018 error
occurred on a server, another error might not occur for several months. Microsoft
personnel recommended taking some of the servers out of production and running extended
Jetstress tests. These tests also revealed nothing. Although all the servers were
all similar to each other, only a minority of the servers (about 20 percent) ever
experienced -1018 problems. Still, this was far above a reasonable threshold for
random errors, and so the server platform was considered suspect.
Microsoft personnel recommended tracking each -1018 error that happened across all
servers in a single spreadsheet. (For details, see "Appendix B: -1018 Record Keeping
" later in this document.) This technique would allow confirmation of subjective
impressions and allow better analysis of subtle patterns that might have been overlooked.
Over time, 17 errors were logged in the spreadsheet and a pattern did emerge. For
most of the -1018 errors, the twenty-eighth bit of the checksum was wrong. If it
was not the twenty-eighth bit, it was the twenty-third or the thirty-second bit.
One of the characteristics of an Exchange checksum is that if an error introduced
on a page is a single bit error (a bit flip), the checksum on the page will also
differ from the checksum that should be on the page by only a single bit.
For example, suppose a -1018 error is reported with these characteristics:
Checksums are stored in little endian format on an Exchange page. The actual checksum
on the page is therefore derived by reversing the order of the four bytes that make
up the eight-digit checksum:
To determine whether two checksums match each other except for a single bit, you
must convert them to binary and then use the XOR logical operator. An XOR
operation compares each bit of one checksum to the corresponding bit of the other.
If the bits are the same (both 0 or both 1), the XOR result is 0. If the
bits are different, the XOR result is 1. Therefore, a single bit difference
between two numbers will result in an XOR result with exactly a single 1
in it. If more than a single bit was changed on a page, the XOR checksum
results will be off by more than a single bit. An illustration of this is shown
in Table 1.
|
Checksums |
Hexadecimal |
Binary |
|
Expected checksum
|
51 79 f5 33
|
00110011 11110101 01111001 01010001
|
|
Actual checksum
|
41 79 f5 33
|
00110011 11110101 01111001 01000001
|
|
XOR Result
|
XOR Result
|
00000000 00000000 00000000 00010000
|
Table 1. Checksum XOR Analysis
Patterns in -1018 corruptions are often a valuable clue for hardware vendors in
identifying an elusive problem. Along with logging the checksum discrepancies, it
is also useful to dump the actual damaged page for direct analysis. (For details,
see "Appendix B: -1018 Record Keeping " later in this document.)
A server was finally discovered where the problem happened more than once within
a short time frame. Jetstress tests were able to consistently create new -1018 errors,
almost always manifesting as a change in the twenty-eighth bit of the checksum.
The server was shipped to Microsoft for analysis. The errors could not be reproduced
despite weeks of stress testing and diagnostics performed by both Microsoft and
the manufacturer.
In the meantime, the customer noticed that -1018 errors had begun to occur on Active
Directory domain controllers as well as on Exchange servers. The Active Directory
database is based on the same engine as the Exchange database, and it also detects
and reports -1018 errors.
It was noticed that the errors seemed to occur on the Active Directory servers after
restarting the servers remotely with a hardware reset card. Investigators at Microsoft
tried restarting the test server in the same way and were eventually successful
in reproducing the problem.
At this point, it might seem that the reset card was the most likely suspect. However,
the error did not occur every time after a restart with the card. Most of the time,
there was no issue. Long Jetstress runs could be done sometimes with no errors,
and then suddenly all Jetstress runs would fail serially.
Eventually, it became apparent that the problem could be reproduced almost every
seventh restart with the card. It was not the fault of the card, but the fact that
the card performed a complete cold restart of the server, simulating a power reset.
After every seventh cold restart, the server would become unstable. This state would
last through warm restarts until the next cold restart, at which time the server
would be stable again until after another six cold restarts.
Both server models in production in the customer's organization used the exact same
server component with the same part number. However, only 20 percent of the components
were manufactured with this problem, which made it much harder to narrow the cause
down to the faulty component.
Case Study 2
A major Exchange customer with 250 Exchange servers was plagued with frequent -1018
errors on multiple servers and multiple SAN disk systems. Rarely did a week go by
without a full-scale -1018 recovery.
There had been significant data loss multiple times after -1018 errors had occurred.
In one case, there was no backup monitoring being done. The most recent Exchange
backup had actually been overwritten, with no subsequent backups succeeding. After
a month, there was a catastrophic failure on the server, and the database was not
salvageable. All user mail was lost. In another case, the first -1018 error corrupted
several hundred thousand pages in the database, and the transaction log drives were
also affected. Backups had also been neglected on this server as the problem worsened.
The most recent backup was several weeks old, and thus all mail since then was lost.
Microsoft Product Support Services had been called multiple times over the last
several months and had been mostly successful in recovering data after each problem.
However, each of these cases involved individual server operators and Product Support
Services engineers, working in isolation on recovery, but not focusing on root cause
analysis across all servers.
The data loss cases got the attention of both the Microsoft account team and the
Exchange customer's executive management. As Microsoft began correlating cases and
asking for more information about the prevalence of the issue, it became clear quickly
that the rate of -1018 occurrences was far above the standard threshold.
Information about past issues was mostly unavailable or incomplete. However, Microsoft
created a spreadsheet to track each new problem. The spreadsheet started to fill
quickly, and patterns began to emerge. The problem was that there was no single
pattern, but multiple patterns.
In several cases, the lowest two bytes of the checksum were changed. This seemed
promising, but then came several errors where bits 29 and 30 were wrong, with nothing
else in common. Then there was an outbreak of errors where there were large-scale
checksum differences with no discernible pattern in the checksums or the damaged
pages. On some servers, there were multiple bad pages. There were frequent transient
-1018 errors, and frequently a checksum on a full database would reveal different
errors on successive runs.
The investigation and resolution lasted almost a year. As time went on, it became
clear that some servers and disk frames were much more problematic than others,
and that this was not just a general problem with all the Exchange servers across
the organization. During that year, the following problems were discovered to be
root causes of -1018 errors:
-
Server operators were hard cycling servers with disk controllers that had no I/O
atomicity guarantees.
-
SANs where there was no logical unit number (LUN) masking, allowed multiple servers
to control a single disk simultaneously, and thus corrupt it.
-
Badly out-of-date firmware revisions were in use, including versions known to cause
data corruption.
-
Cluster systems had not passed Windows Hardware Quality Labs (WHQL) certification.
These clusters had disk controllers that were unable to handle in-flight disk I/Os
during cluster state transitions.
-
Antivirus applications were not configured correctly to exclude Exchange data files.
This was causing sudden quarantine, deletion, or alteration of Exchange files and
processes. Generic file scanning antivirus programs should never be used on Exchange
databases. Many vendors have effective Exchange-aware scanners that implement the
Microsoft Exchange antivirus APIs.
-
A vendor hardware bug accounted for a minority of the errors.
-
Aging and progressively failing hardware, which had exceeded its lifecycle, caused
obvious problems.
Correcting the -1018 root causes was an arduous, but ultimately worthwhile process.
It required not only changes to hardware and configurations, but also operational
improvements. Not only was the organization successful in dramatically reducing
the incidence of -1018 errors, but also in greatly decreasing the impact of each
error on end users by implementing effective monitoring and recovery procedures.
This case study contrasts sharply with Case Study 1. In Case Study 1, a mysterious
and subtle hardware bug was the single root cause for all the failures. However,
for most Exchange administrators, the key to reducing and controlling -1018 errors
will be implementing ordinary operational improvements. Most of the time, the patterns
revealed by keeping track of -1018 errors across your organization will point to
obvious errors and problems that should be defended against. Case Study 1, while
perhaps more interesting, was atypical, while Case Study 2 is representative of
the process that several Exchange organizations have gone through to control and
reduce -1018 errors.
Apendix B: -1018 Record Keeping
For the majority of -1018 errors, the root cause will be indicated by another correlated
error or failure. For errors where the cause is not so obvious, tracking -1018 errors
across time and across servers is critical for identifying the root cause.
Even for errors where the root cause is easily determined, there is still value
in consistently tracking -1018 errors. You can learn how the errors affect your
organization, and where operational and other improvements could reduce the impact
of the errors.
You may want to track errors in a database, in a spreadsheet, or using a simple
text file. At Microsoft, Microsoft Office Excel 2003 spreadsheets are used. The
following list of fields can be adapted to your needs and your willingness to track
detailed information.
Essentials
These files should always be saved for each -1018 error:
Eseutil Page Dump
This Eseutil facility will show you the contents of important header fields on the
page. This command requires the logical page number. You can calculate the logical
page number from the error description as described in "Page Ordering" earlier in
this document.
If, for example, logical page 578 is damaged in the database file Priv1.edb, you
can dump the page to the file 578.txt with this command:
Eseutil.exe /M priv1.edb /P578 ≥ 578.txt
Note that there is no space between the /P switch and the page number.
The output of this command might look similar to this:
Microsoft(R) Exchange Server Database Utilities
Version 6.5
Copyright (C) Microsoft Corporation. All Rights Reserved.
Initiating FILE DUMP mode...
Database: priv1.edb
Page: 578
checksum <0x03300000, 8>: 2484937984258 (0x0000024291d88902)
expected checksum = 0x0000024291d88902
****** checksum mismatch ******
actual checksum = 0x00de00de91d889fd
new checksum format
expected ECC checksum = 0x00000242
actual ECC checksum = 0x00de00de
expected XOR checksum = 0x91d88902
actual XOR checksum = 0x91d889fd
checksum error is NOT correctable
dbtimeDirtied <0x03300008, 8>: 12701 (0x000000000000319d)
pgnoPrev <0x03300010, 4>: 577 (0x00000241)
pgnoNext <0x03300014, 4>: 579 (0x00000243)
objidFDP <0x03300018, 4>: 114 (0x00000072)
cbFree <0x0330001C, 2>: 6 (0x0006)
cbUncommittedFree <0x0330001E, 2>: 0 (0x0000)
ibMicFree <0x03300020, 2>: 4038 (0x0fc6)
itagMicFree <0x03300022, 2>: 3 (0x0003)
fFlags <0x03300024, 4>: 10370 (0x00002882)
Leaf page
Primary page
Long Value page
New record format
New checksum format
TAG 0 cb:0x0000 ib:0x0000 offset:0x0028-0x0027 flags:0x0000
TAG 1 cb:0x000e ib:0x0000 offset:0x0028-0x0035 flags:0x0001 (v)
TAG 2 cb:0x0fb8 ib:0x000e offset:0x0036-0x0fed flags:0x0001 (v)
If you do not see a checksum mismatch in the dump, that does not necessarily mean
that the -1018 error is transient. It is possible that a mistake was made in calculating
the logical page number. It is a good idea to double-check your arithmetic, and
to dump the preceding and next pages as well if you do not find a -1018 error on
the dumped page. Running Eseutil /K against the entire database will also provide
an additional check.
Required Error Information
For each -1018 occurrence, you should always log the following:
-
Application log -1018 event information:
-
Date and time
-
Server name
-
Event ID
-
Event description
-
If a cluster, cluster node where the error occurred
-
Server make and model
-
Storage type:
-
Direct access storage device (DASD)
-
Fiber Channel Storage Area Network (SAN)
-
Internet small computer system interface (iSCSI) SAN
-
Network-attached storage
-
Storage make and model:
-
Disk controller
-
Multiple path configuration
-
Permanent location or share for event, log, and dump files
Additional Information
For each 1018 occurrence, you can also note the following:
-
Bookend period anomalies:
-
Restart
-
Cluster transition
-
Disk error
-
Memory error
-
Other
-
File offset
-
Logical page number (calculated from byte offset)
-
Actual checksum (calculated at run time)
-
Expected checksum (read from page)
-
Binary actual checksum
-
Binary expected checksum
-
Checksum XOR result
-
How discovered (run time, mount failure, or backup failure)
-
Server unavailable or available
-
Last good backup time
-
Error confirmed by, such as: Eseutil /m /p, /k
-
Permanent or transient error
-
Location of files (Eseutil and Esefile page dumps, raw page dumps, MPSReports)
-
Server hardware
-
Server BIOS
-
Controller
-
Controller firmware revision
-
Storage
-
Impact (databases affected)
-
Recovery downtime
-
Recovery strategy
-
Root cause
-
Comments
-
Entry by
XOR Calculation Sample for Excel
Appendix A described how to compare checksums to look for patterns. The Microsoft
Office Excel formulas below can be used to automate this comparison. You must install
the Analysis Toolpak for Excel for the necessary functions to be available. The
Toolpak can be installed from the Tools, Add-Ins menu in Excel.
Converting a Hexadecimal Checksum to Binary
Copy this formula into an Excel cell. This formula assumes that the hexadecimal
checksum is in cell A1. If the hexadecimal checksum is in a different cell, change
each reference to A1 in the formula to represent the actual cell. Ignore line breaks
in the formula—it is intended to be a single line in Excel:
=CONCATENATE(HEX2BIN(MID(A1,7,2),8)," ",HEX2BIN(MID(A1,5,2),8),"
",HEX2BIN(MID(A1,3,2),8)," ",HEX2BIN(MID(A1,1,2),8))
This formula also reverses each byte of the checksum to conform to the Intel little
endian storage format.
Using XOR with Two Binary Checksums
This formula assumes that the binary checksums are in cells B1 and B2. If the checksums
are in other cells, replace each occurrence of B1 or B2 as appropriate. Ignore line
breaks in the formula—it is intended to be a single line in Excel:
=CONCATENATE((HEX2BIN(BIN2HEX(VALUE(SUBSTITUTE(MID(B1,1,8)+MID(B2,1,8),2,0
)),8),8))," ",
(HEX2BIN(BIN2HEX(VALUE(SUBSTITUTE(MID(B1,10,8)+MID(B2,10,8),2,0)),8),8)),"
",(HEX2BIN(BIN2HEX(VALUE(SUBSTITUTE(MID(B1,19,8)+MID(B2,19,8),2,0)),8),8)),"
",(HEX2BIN(BIN2HEX(VALUE(SUBSTITUTE(MID(B1,28,8)+MID(B2,28,8),2,0)),8),8)))