Crawl content (Search Server 2008)

Article
07/25/2014

Applies To: Microsoft Search Server 2008

Topic Last Modified: 2010-09-20

Note

Unless otherwise noted, the information in this article applies to both Microsoft Search Server 2008 and Microsoft Search Server 2008 Express.

Crawling content is the process by which a system accesses and parses content and its properties, sometimes called metadata, to build a content index from which search queries can be served.

The result of successfully crawling content is that the individual files or pieces of content that you want to make available to search queries are accessed and read by the crawler. The keywords and metadata for those files are stored in the content index, sometimes called the index. The index consists of the keywords that are stored in the file system of the index server and the metadata that is stored in the search database. The system maintains a mapping between the keywords, the metadata associated with the individual pieces of content, and the URL of the source from which the content was crawled.

Note

The crawler does not change the files on the host servers. Instead, the files on the host servers are accessed and read, and the text and metadata for those files are sent to the index server to be indexed. However, because the crawler reads the content on the host server, some servers that host certain sources of content might update the last accessed date on files that have been crawled.

Determining when to crawl content

After a server farm has been deployed and running for some time, a search services administrator typically must change the crawl schedule. This might need to be done for the following reasons:

To accommodate changes in downtimes and periods of peak usage.
To accommodate changes in the frequency at which content is updated on the servers hosting the content.
To schedule crawls so that:
- Content that is hosted on slower host servers is crawled separately from content that is hosted on faster host servers.
- New content sources are crawled.
- Crawls occur as often as targeted content is updated. For example, you might want to perform daily crawls on repositories that are updated each day and crawl repositories that are updated infrequently less often.

Performing crawls

Typically, you want to automate most crawls by scheduling them. However, sometimes you might want to manually start a crawl. For example, you might start a crawl to apply administrative changes such as crawl rules to the content you crawl and index or to determine if an error in the crawl log has been resolved.

In addition, whether a crawl is started manually or by a schedule, you might need to stop or pause one or more crawls. For example, an administrator whose server hosts the content you are crawling might notify you that crawling is placing too much load on the server or you might be alerted that the server you are crawling is currently offline. In either of these cases, you might want to stop or pause the crawl.

You should consider that more time and server resources are required to perform a full crawl than are required to perform an incremental crawl. Full crawls:

Consume more memory and CPU cycles on the index server than incremental crawls.
Consume more memory and CPU cycles on the Web front-end servers when crawling content in your server farm. This does not apply to content that is external to your server farm.
Use more network bandwidth than incremental crawls.

Important

When you stop a crawl of any content source, the next time you crawl that content source, Microsoft Search Server 2008 automatically performs a full crawl of the content source the next time it is crawled. This is true even if you attempt to perform an incremental crawl. Therefore, carefully consider whether you should pause the crawl instead of stopping the crawl.

You must also be careful not to pause crawls for too many content sources at the same time, because each content source that is paused consumes memory and CPU resources on the index server.

To start a full or incremental crawl, stop, pause, or resume a crawl, do one of the following:

Scheduling crawls

The following sections provide more information about considerations for crawling content on a schedule.

Downtimes and periods of peak usage

Consider the downtimes and peak usage times of the servers that host the content you want to crawl. For example, if you are crawling content hosted by many different servers outside your server farm, it is likely that these servers are backed up on different schedules and have different peak usage times. The administration of servers outside your server farm is typically out of your control. Therefore, we recommend that you coordinate your crawls with the administrators of the servers that host the content you want to want to crawl to ensure you do not attempt to crawl content on their servers during a downtime or peak usage time.

Note

Because the times of peak usage and downtimes for host servers can change, we recommend that you periodically reevaluate your crawl schedules for all content sources, not just the new ones you create.

A common scenario involves content outside the control of your organization that relates to the content on your SharePoint sites. You can add the start addresses for this content to an existing content source or create a new content source for external content. Because availability of external sites varies widely, it is helpful to add separate content sources for different external content. In this way, the content sources for external content can be crawled at different times than your other content sources. You can update external content on a crawl schedule that accounts for the availability of each site.

Content that is updated frequently

When planning crawl schedules, consider that some sources of content are updated more frequently than others. For example, if you know that content on some site collections or external sources is updated only on Fridays, it wastes resources to crawl that content more frequently than once each week. However, your server farm might contain other site collections that are continually updated Monday through Friday, but not typically updated on Saturdays and Sundays. In this case, you might want to crawl those sites several times during the week and not at all on weekends.

The way in which content is stored across the site collections in your environment can guide you to create additional content sources for each of your site collections in each of your Web applications. For example, if a site collection stores only archived information, you might not need to crawl that content as frequently as you crawl a site collection that stores frequently updated content. In this case, you might want to crawl these two site collections using different content sources so that they can be crawled on different schedules.

Full and incremental crawl schedules

As a search services administrator, you can independently configure crawl schedules for each content source. For each content source, you can specify a time to do full crawls and a different time to do incremental crawls.

Note

You must run a full crawl for a particular content source before you can run an incremental crawl.

We recommend that you plan crawl schedules based on the availability, performance, and bandwidth considerations of the servers running the search service and the servers hosting the crawled content.

When you plan crawl schedules, consider the following best practices:

Group start addresses in content sources based on similar availability and with acceptable overall resource usage for the servers that host the content.
Schedule incremental crawls for each content source during times when the servers that host the content are available and when there is low demand on the resources of the server. You can also add or edit one or more crawler impact rules to reduce the load on the servers that are being crawled. For information about crawler impact rules, see Manage crawler impact (Search Server 2008).
Stagger crawl schedules so that the load on the servers in your farm is distributed over time.
Schedule full crawls only when necessary for the reasons listed in the next section. We recommend that you do full crawls less frequently than incremental crawls.
Schedule administration changes that require a full crawl to occur shortly before the planned schedule for full crawls. For example, we recommend that you attempt to schedule the creation of the crawl rule before the next scheduled full crawl so that an additional full crawl is not necessary.
Base simultaneous crawls on the capacity of the index server to crawl them. We recommend that you stagger your crawl schedules so that the index server does not crawl using multiple content sources at the same time. The performance of the index server and the performance of the servers hosting the content determine the extent to which crawls can be overlapped. A strategy for scheduling crawls can be developed over time as you can become familiar with the typical crawl durations for each content source. We recommend that you record trend data of how long crawls take in your environment.

Reasons to perform a full crawl

Reasons for a search services administrator to do a full crawl include:

A hotfix or service pack was installed on servers in the farm. See the instructions for the hotfix or service pack for more information.
A search services administrator added a new managed property. A full crawl is required for the new managed property to take effect immediately. If you do not want the new managed property to take effect immediately, a full crawl is not required.
To re-index ASPX pages on Windows SharePoint Services 3.0 sites.

Note

The crawler cannot discover when ASPX pages on Windows SharePoint Services 3.0 sites have changed. Because of this, incremental crawls do not re-index views or home pages when individual list items are deleted. We recommend that you periodically do full crawls of sites that contain ASPX files to ensure that these pages are re-indexed.
To detect security changes that were made on a file share after the last full crawl of the file share.
To resolve consecutive incremental crawl failures. In rare cases, if an incremental crawl fails one hundred consecutive times at any level in a repository, the index server removes the affected content from the index.
Crawl rules have been added, deleted, or modified.
To repair a corrupted index.
The search services administrator has created one or more server name mappings.
The account assigned to the default content access account or crawl rule has changed.

The system does a full crawl even when an incremental crawl is requested under the following circumstances:

A search services administrator stopped the previous crawl.
A content database was restored, or a farm administrator has detached and reattached a content database.

Note

If you are running the Infrastructure Update for Microsoft Office Servers, you can use the restore operation of the Stsadm command-line tool to change whether a content database restore causes a full crawl.
A full crawl of the site has never been done.
The change log does not contain entries for the addresses that are being crawled. Without entries in the change log for the items being crawled, incremental crawls cannot occur.

You can adjust schedules after the initial deployment based on the performance and capacity of servers in the farm and the servers hosting content.

Crawl content (Search Server 2008)

Determining when to crawl content

Performing crawls

Scheduling crawls

Downtimes and periods of peak usage

Content that is updated frequently

Full and incremental crawl schedules

Reasons to perform a full crawl

See Also

Concepts

Additional resources