crawlerconsistency.exe reference

 

Applies to: FAST Search Server 2010

Use the crawlerconsistency tool to verify and repair the consistency of the crawler item and metadata structures on disk. You can also use the tool to verify and maintain internal crawler store consistency, or when recovering a damaged crawl store.

By default, the tool detects and attempts to repair the following inconsistencies:

  • Items referenced in metadatabases, but not found in the item store.

  • Invalid items in the item store.

  • Unreferenced items in the item store (requires the docrebuild mode.)

  • Duplicate database checksums not found in metadatabases.

  • Multiple checksums assigned to the same URI in the duplicate database.

These inconsistencies are automatically corrected in the doccheck or docrebuild mode, followed by the metacheck mode. Any non-consistent URIs are logged, and a delete operation is issued to the indexer (which can be disabled) to ensure synchronization.

In a multi-node crawler environment, you can also use the tool to rebuild a duplicate server from the contents of per-node scheduler post-process checksum databases by using the ppduprebuild mode. Since this mode builds the duplicate server from scratch, you can also use it to change the number of duplicate servers that are used, by first changing the configuration and then rebuilding.

Note

To use a command-line tool, verify that you meet the following minimum requirements: You are a member of the FASTSearchAdministrators local group on the computer where FAST Search Server 2010 for SharePoint is installed.

Syntax

<FASTSearchFolder>\bin\crawlerconsistency [options]

Parameters

Parameter Description

<FASTSearchFolder>

The path of the folder where you have installed FAST Search Server 2010 for SharePoint, for example C:\FASTSearch.

crawlerconsistency options

Option Value Required Description

-M

<mode>[<mode,..., <mode>]

Yes

Specifies one or more modes to run the tool in:

  • doccheck - Verifies that all items referenced in the metadatabases also exist on disk.

  • docrebuild - Same as doccheck, but rewrites all referenced items to a fresh item store, removing orphans in the item store.

    Note

    This can take a long time.

  • metacheck - Verifies that all checksums referenced in the postprocess databases also exist in the metadatabases.

  • metarebuild - Attempts to recovery a damaged metastore. Supports rebuilding a bad or lost site database based on segment databases.

  • duprebuild - Rebuilds the contents of the duplicate server(s) from the local post process database. Run this mode alone (without other modes).

Additionally, you can add the following modifiers:

  • updatestat - Updates the statistics item store counter.

    Note

    Use with the doccheck and docrebuild modes only

  • routecheck - Verifies that sites/URIs are routed to the correct node scheduler.

Include additional modifiers in a comma-separated <mode> list. For example:

-M doccheck,docrebuild,updatestat

Applies to multi-node crawlers only.

-O

<path>

Yes

The folder for all output logs.

The tool creates a subfolder named with the current date: <year><month><date>.

If the subdirectory already exists, a number is appended (e.g. ".1") to the name.

-d

<path>

No

Location of crawl data, run-time configuration, and logs in subdirectories in the specified directory.

Default: data

-U

No

Indicates that you are running the tool on the multi-node scheduler and that data is in a different folder: data\crawler\config\multinode vs. data\crawler\config\node.

Applies to routecheck mode.

-C

<crawl_collection>[,<crawl_collection>,...,<crawl_collection>]

No

A comma-separated list of collections to check.

Default: all collections

-c

<cluster>[,<cluster>,...,<cluster>]

No

A comma-separated list of clusters to check.

Applies to doccheck and docrebuild modes.

Default: all clusters

-S

<crawl_site>[,<crawl_site>,...,<crawl_site>]

No

Only process the specified site(s).

Applies to doccheck mode.

Default: All sites

-z

No

Compresses items in the item store when you run the docrebuild mode. This overrides the collection level option to compress items (if you specify it).

Default: off

-i

No

Skips free disk space checks. Normally the tool checks the free disk space periodically; if it drops under 1GB, it will stop the operation and exit.

Warning

Use this option with caution.

-n

No

Indicates that delete operations are not submitted to the indexer; they are only logged to files.

To ensure that deleted items are not left in the index, manually delete those items, or refeed the collection into an empty index.

-F

<file>

No

Loads the crawler global configuration from <file>. Any conflicting options on the command line override values in the file.

-T

No

Runs the tool in test mode. The tool does not delete anything from disk or issue any deletes to the indexer.

-h

No

Displays help.

-v

No

Displays version information.

-l

<log-level>

No

Specifies the kind of information to log:

  • debug

  • verbose

  • info

  • warning

  • error

Examples

The following example verifies and repairs item store and metastore consistency, and updates statistics counters.

<FASTSearchFolder>\bin\crawlerconsistency -M doccheck,metacheck,updatestat -O <FASTSearchFolder>\var\log\crawler\consistency\ -C MyCollection

This will verify each metadatabase entry, verify any corresponding item content in the crawler store, and log inconsistencies in the specified log files.

Remarks

The tool generates the following log files. Log files are only created when the first URI is written to a file.

Log file name Description

<mode>_ok.txt

Lists URIs found that were not removed as inconsistencies.

The output from the metacheck mode lists every URI with a unique checksum, useful for comparing against the index.

Note

Items may have been dropped by the pipeline, leaving URIs in this file that are not in the index. You can safely remove URIs in the index that are not in this file.

<mode>_deleted.txt

Lists URIs deleted by the tool. Unless you disabled indexer deletes with the -n option, the URIs were removed from the index. Since these URIs were deleted as crawler inconsistencies, they may still exist on the Web servers and should be indexed. Recrawl this list of URIs with the crawleradmin tool by using the --addurifile option (also use the --force option to speed up crawling).

<mode>_deleted_reasons.txt

This log file is identical to the <mode>_deleted.txt file but also includes an "error code" to identify each URI delete reason. Definitions for each error code include the following:

  • 101 - Item not found in item store

  • 102 - Item found, but unreadable, in item store

  • 103 - Item found, but length does not match meta information

  • 201 - Metadata for item not found

  • 202 - Metadata found, but unreadable

  • 203 - Metadata found, but does not match checksum in duplicate database

  • 204 - Metadata found, but has no checksum

  • 206 - URI's host name not found in routing database

<mode>_wrongnode.txt

Used for multi-node crawls, this file outputs all URIs removed from a node because of incorrect routing. These URIs should be crawled by a different master node. The URIs are logged, but not deleted from the index.

<mode>_refeed.txt

Lists URIs that had their URI equivalence class updated by running the tool. To bring the index in sync, use postprocess refeed with the -i option to refeed the contents of this file. Or, perform a full refeed.

Note

Always redirect the stdout and stderr output to a log file on disk.

See Also

Reference

crawleradmin.exe reference

Concepts

Web Crawler XML configuration reference