Determine crawl schedules
Published: May 12, 2010
The key variable that affects the load on the crawler, the network, and the remote Web servers is the request rate. This value is determined by the delay setting, which indicates how long the Web crawler should wait, after fetching a Web item, before requesting the next one.
For each Web site that is crawled concurrently, the overall Web item request rate will be a function of the delay and max_pending settings, as well as the response time of the Web server itself. The Web crawler's overall download rate depends on how many Web sites are crawling concurrently. The theoretical maximum speed is defined by dividing the value of the max_sites settings with the delay setting. However, the actual speed will generally be the number of concurrently crawling Web sites divided by the delay setting.
Do not overload any Web sites or servers you plan to crawl.
Use the default request rate of 60 seconds (or higher) when crawling Web sites that are not owned by the organization running the crawl. This rate avoids putting a high burden on the Web server from which Web items are being requested. For Web crawls within the same organization or network, lower values may be used, although note that using very low values (for example, less than 5 seconds) can be stressful on the systems involved.
The refresh interval, configured by the refresh setting, determines the overall length of the crawl cycle. The crawl cycle is the period of time over which a crawl should run without revisiting a Web site to see if new or modified Web items exist. Picking an appropriate interval depends on the amount of content to be fetched (which in turn depends both on the number of Web sites and how many Web items each contain) and the update rate or freshness of the Web sites. In some cases, there are many Web sites with very static/stable content, and a few that are updated frequently; these can be configured either as separate crawl collections, or given distinct settings through the use of a sub collection.
The behavior of the Web crawler at the end of the crawl cycle depends upon the refresh_mode and refresh_when_idle settings, as well as and the current level of activity. If the refresh interval is sufficiently large to allow all Web sites to be completely crawled, and the refresh_when_idle setting is disabled, the crawler will remain idle until the refresh interval ends. If the setting is enabled, a new crawl cycle will start immediately. A new crawl cycle is started by adding the start URLs to the crawl work queue.