Determine when to crawl your content (Office SharePoint Server 2007 for Search)

After a server farm has been deployed and running for some time, a shared services administrator typically must change when some of the content is crawled. This must be done for the following reasons:

  • To accommodate changes in downtimes and periods of peak usage.

  • To accommodate changes in the frequency at which content is updated on the servers hosting the content.

  • To schedule crawls so that:

    • Content that is hosted on slower host servers is crawled separately from content that is hosted on faster host servers.

    • New content sources are crawled.

The following sections provide more information about crawling content on different schedules.

Downtimes and periods of peak usage

Consider the downtimes and peak usage times of the servers that host the content you want to crawl. For example, if you are crawling content hosted by many different servers outside your server farm, it is likely that these servers are backed up on different schedules and have different peak usage times. The administration of servers outside your server farm is typically out of your control. Therefore, we recommend that you coordinate your crawls with the administrators of the servers that host the content you want to want to crawl to ensure you do not attempt to crawl content on their servers during a downtime or peak usage time.

Hinweis

Because the times of peak usage and downtimes for host servers can change, we recommend that you periodically reevaluate your crawl schedules for all content sources, not just the new ones you create.

A common scenario involves content outside the control of your organization that relates to the content on your SharePoint sites. You can add the start addresses for this content to an existing content source or create a new content source for external content. Because availability of external sites varies widely, it is helpful to add separate content sources for different external content. In this way, the content sources for external content can be crawled at different times than your other content sources. You can then update external content on a crawl schedule that accounts for the availability of each site.

Content that is updated frequently

When planning crawl schedules, consider that some sources of content are updated more frequently than others. For example, if you know that content on some site collections or external sources is updated only on Fridays, it wastes resources to crawl that content more frequently than once each week. However, your server farm might contain other site collections that are continually updated Monday through Friday, but not typically updated on Saturdays and Sundays. In this case, you might want to crawl several times each weekday, but only once or twice on weekends.

The way in which content is stored across the site collections in your environment can guide you to create additional content sources for each of your site collections in each of your Web applications. For example, if a site collection stores only archived information, you might not need to crawl that content as frequently as you crawl a site collection that stores frequently updated content. In this case, you might want to crawl these two site collections using different content sources so that they can be crawled on different schedules.

Full and incremental crawl schedules

Shared services administrators can independently configure crawl schedules for each content source. For each content source, they can specify a time to do full crawls and a different time to do incremental crawls. Note that you must run a full crawl for a particular content source before you can run an incremental crawl.

We recommend that you plan crawl schedules based on the availability, performance, and bandwidth considerations of the servers running the search service and the servers hosting the crawled content.

When you plan crawl schedules, consider the following best practices:

  • Group start addresses in content sources based on similar availability and with acceptable overall resource usage for the servers that host the content.

  • Schedule incremental crawls for each content source during times when the servers that host the content are available and when there is low demand on the resources of the server.

  • Stagger crawl schedules so that the load on the servers in your farm is distributed over time.

  • Schedule full crawls only when necessary for the reasons listed in the next section. We recommend that you do full crawls less frequently than incremental crawls.

  • Schedule administration changes that require a full crawl to occur shortly before the planned schedule for full crawls. For example, we recommend that you attempt to schedule the creation of the crawl rule before the next scheduled full crawl so that an additional full crawl is not necessary.

  • Base simultaneous crawls on the capacity of the index server to crawl them. We recommend that you stagger your crawl schedules so that the index server does not crawl using multiple content sources at the same time. The performance of the index server and the performance of the servers hosting the content determine the extent to which crawls can be overlapped. A strategy for scheduling crawls can be developed over time as you can become familiar with the typical crawl durations for each content source.

Reasons to perform a full crawl

Reasons for a shared services administrator to perform a full crawl include:

  • Software updates or service packs were installed on servers in the farm.

  • A shared services administrator added a new managed property.

  • To reindex ASPX pages on Windows SharePoint Services 3.0 or Microsoft Office SharePoint Server 2007 sites.

    Hinweis

    The crawler cannot discover when ASPX pages on Windows SharePoint Services 3.0 or Microsoft Office SharePoint Server 2007 sites have changed. Because of this, incremental crawls do not reindex views or home pages when individual list items are deleted. We recommend that you periodically do full crawls of sites that contain ASPX files to ensure that these pages are reindexed.

  • To detect security changes that were made on file shares after the last full crawl of the file share.

  • To resolve consecutive incremental crawl failures. In rare cases, if an incremental crawl fails one hundred consecutive times at any level in a repository, the index server removes the affected content from the index.

  • Crawl rules have been added or modified.

  • To repair a corrupted index.

The system does a full crawl even when an incremental crawl is requested under the following circumstances:

  • A shared services administrator stopped the previous crawl.

  • A content database was restored.

    Hinweis

    You can use the Stsadm restore operation to change whether a content database restore causes a full crawl.

  • A full crawl of the site has never been done.

  • To repair a corrupted index. Depending upon the severity of the corruption, the system might attempt to perform a full crawl if corruption is detected in the index.

You can adjust schedules after the initial deployment based on the performance and capacity of servers in the farm and the servers hosting content.

Task requirements

To schedule a crawl, you can perform one of the following procedures:

Siehe auch

Konzepte

Crawl content (Office SharePoint Server 2007 for Search)