Best practices for using crawl logs (Search Server 2010)

 

Applies to: Search Server 2010

Topic Last Modified: 2010-12-20

The crawl log tracks information about the status of crawled content. This log lets you determine whether crawled content was successfully added to the index, whether it was excluded because of a crawl rule, or whether indexing failed because of an error. The crawl log also contains more information about crawled content, including the time of the last successful crawl, the content sources, and whether any crawl rules were applied. You can use the crawl log to diagnose problems with the search experience.

In this article:

  • To view the crawl log

  • Crawl log views

  • Crawl log timer job

  • Troubleshoot common problems

To view the crawl log

  1. Verify that the user account that is performing this procedure is an administrator for the Search service application.

  2. In Central Administration, in the Quick Launch, click Application Management.

  3. On the Application Management page, under Service Applications, click Manage service applications.

  4. On the Service Applications page, in the list of service applications, click the Search service application that you want.

  5. On the Search Administration page, in the Quick Launch, under Crawling, click Crawl Log.

  6. On the Crawl Log – Content Source page, click the view that you want.

Crawl log views

The following table shows the different views that you can select for viewing the status of crawled content.

View Description

Content Source

Summarizes items crawled per content source. Shows successes, warnings, errors, top-level errors, and deletes. The data in this view represent the current status of items that are already present in the index per content source. The Object Model provides the data for this view.

Host Name

Summarizes items crawled per host. Shows successes, warnings, errors, deletes, top-level errors, and total. The data in this view represent the current status of items that are already present in the index per host. If your environment has multiple crawl databases, the data is shown per crawl database. The Search Administration database provides the data for this view. You can filter the results by typing a URL in the Find URLs that begin with the following hostname/path: box.

URL

Lets you search the crawl logs by content source or URL or host name and view details of all items that are present in the index. The MSSCrawlURLReport table in the crawl database provides the data for this view. You can filter the results by setting the Status, Message, Start Time, and End Time fields.

Crawl History

Summarizes crawl transactions that were completed during a crawl. There can be multiple crawl transactions per item in a single crawl, so the number of transactions can be larger than the total number of items. This view shows data for three kinds of crawls:

  • Full. Crawls all items in a content source.

  • Incremental. Crawls items that have been changed since the last full or incremental crawl. This kind of crawl only runs if it is scheduled.

  • Delete. If start addresses are removed from a content source, a delete crawl removes items associated with the deleted start address from the index before a full or incremental crawl runs. This kind of crawl cannot be scheduled.

The Search Administration database provides the data for this view. You can filter the results by content source.

Error Message

Provides aggregates of errors per content source or host name. The MSSCrawlURLReport table in the crawl database provides the data for this view. You can filter by content source or host.

Note

The filter drop-down box only shows content sources that contain errors. If there is an error against an item that does not appear in the index, the error does not appear in this view.

The Content Source, Host Name, and Crawl History views show data in the following columns:

  • Successes. Items that were successfully crawled and searchable.

  • Warnings. Items that might not have been successfully crawled and might not be searchable.

  • Errors. Items that were not successfully crawled and might not be searchable.

  • Deletes. Items that were removed from the index and are no longer searchable.

  • Top Level Errors. Errors in top-level documents, including start addresses, virtual servers, and content databases. Every top-level error is counted as an error, but not all errors are counted as top-level errors. Because the Errors column includes the count from the Top Level Errors column, top-level-errors are not counted again in the Host Name view.

  • Not Modified. Items that were not modified between crawls.

  • Security Update. Items whose security settings were crawled because they were modified.

Crawl log timer job

By default, the data for each view in the crawl log is refreshed every five minutes by the timer job Crawl Log Report for Search Application <Search Service Application name>. You can change the refresh rate for this timer job, but in general, this setting should remain as is.

Tip

If you think the crawl log is not showing fresh data, make sure that the timer job has not been paused and has recently run.

To check the status of the crawl log timer job

  1. Verify that the user account that is performing this procedure is a member of the Farm Administrators SharePoint group.

  2. In Central Administration, in the Monitoring section, click Check job status.

  3. On the Timer Job Status page, click Job History.

  4. On the Job History page, find Crawl Log Report for Search Application <Search Service Application name> for the Search service application that you want and review the status.

To change the refresh rate for the crawl log timer job

  1. Verify that the user account that is performing this procedure is a member of the Farm Administrators SharePoint group.

  2. In Central Administration, in the Monitoring section, click Check job status.

  3. On the Timer Job Status page, click Job History.

  4. On the Job History page, click Crawl Log Report for Search Application <Search Service Application name> for the Search service application that you want.

  5. On the Edit Timer Job page, in the Recurring Schedule section, change the timer job schedule to the interval that you want.

  6. Click OK.

Troubleshoot common problems

This section provides information about common crawl log errors, crawler behavior, and actions to take to maintain a healthy crawling environment.

When an item is deleted from the index

When a crawler cannot find an item that exists in the index because the URL is obsolete or it cannot be accessed due to a network outage, the crawler reports an error for that item in that crawl. If this continues during the next three crawls, the item is deleted from the index. For file-share content sources, items are immediately deleted from the index when they are deleted from the file share.

“Object could not be found” error for a file share

This error can result from a crawled file-share content source that contains a valid host name but an invalid file name. For example, with a host name and file name of \\ValidHost\files\file1, \\ValidHost exists, but the file file1 does not. In this case, the crawler reports the error "Object could not be found" and deletes the item from the index. The Crawl History view shows:

  • Error: 1

  • Deletes: 1

  • Top Level Errors: 1 (\\ValidHost\files\file1 shows as a top-level error because it is a start address)

The Content Source view shows:

  • Errors: 0

  • Deletes: 0

  • Top Level Errors: 0

The Content Source view will show all zeros because it only shows the status of items that are in the index, and this start address was not entered into the index. However, the Crawl History view shows all crawl transactions, whether or not they are entered into the index.

“Network path for item could not be resolved” error for a file share

This error can result from a crawled file-share content source that contains an invalid host name and an invalid file name. For example, with a host name and file name of \\InvalidHost\files\file1, both \\InvalidHost and the file file1 do not exist. In this case, the crawler reports the error "Network path for item could not be resolved" and does not delete the item from the index. The Crawl History view shows:

  • Errors: 1

  • Deletes: 0

  • Top Level Errors: 1 (\\InvalidHost\files\file1 shows as a top-level error because it is a start address)

The Content Source view shows:

  • Error: 0

  • Deletes: 0

  • Top Level Errors: 0

The item is not deleted from the index, because the crawler cannot determine if the item really does not exist or if there is a network outage that prevents the item from being accessed.

Obsolete start addresses

The crawl log reports top-level errors for top-level documents, or start addresses. To ensure healthy content sources, you should take the following actions:

  • Always investigate non-zero top-level errors.

  • Always investigate top-level errors that appear consistently in the crawl log.

  • Otherwise, we recommend that you remove obsolete start addresses every two weeks after contacting the owner of the site.

To troubleshoot and delete obsolete start addresses

  1. Verify that the user account that is performing this procedure is an administrator for the Search service application.

  2. When you have determined that a start address might be obsolete, first determine whether it exists or not by pinging the site. If you receive a response, determine which of the following issues caused the problem:

    • If you can access the URL from a browser, the crawler could not crawl the start address because there were problems with the network connection.

    • If the URL is redirected from a browser, you should change the start address to be the same as the new address.

    • If the URL receives an error in a browser, try again at another time. If it still receives an error after multiple tries, contact the site owner to ensure that the site is available.

  3. If you do not receive a response from pinging the site, the site does not exist and should be deleted. Confirm this with the site owner before you delete the site.

Access Denied

When the crawl log continually reports an "Access Denied" error for a start address, the content access account might not have Read permissions to crawl the site. If you are able to view the URL with an administrative account, there might be a problem with how the permissions were updated. In this case, you should contact the site owner to request permissions. For information about how to set permissions for a crawler, see Manage crawl rules (Search Server 2010).

Numbers set to zero in Content Source view during host distribution

During a host distribution, the numbers in all columns in Content Source view are set to zero. This happens because the numbers in Content Source view are sourced directly from the crawl database tables. During a host distribution, the data from these tables are being moved, so the values remain at zero during the duration of the host distribution.

After the host distribution is complete, run an incremental crawl of the content sources in order to restore the original numbers.

Showing file-share deletes in Content Source view

When documents are deleted from a file-share content source that was successfully crawled, they are immediately deleted from the index during the next full or incremental crawl. These items will show as errors in the Content Source view of the crawl log, but will show as deletes in other views.

Stopping or restarting the SharePoint Server Search service causes crawl log transaction discrepancy

The SharePoint Server Search service (OSearch14) might be reset or restarted due to administrative operations or server functions. When this occurs, a discrepancy in the crawl history view of the crawl log can occur. You may notice a difference between the number of transactions reported per crawl and the actual number of transactions performed per crawl. This can occur because the OSearch14 service stores active transactions in memory and writes these transactions after they are completed. If the OSearch14 service is stopped, reset, or restarted before the in-memory transactions have been written to the crawl log database, the number of transactions per crawl will be shown incorrectly.