Crawling Web content with the FAST Search Web crawler

 

Applies to: FAST Search Server 2010

The FAST Search Web crawler gathers Web items (or Web pages) from Web servers on a network. This is typically a bounded institutional or corporate network, but potentially it can crawl the entire Internet.

The FAST Search Web crawler works, in many ways, like a Web browser downloading content from Web servers. But unlike a Web browser that responds only to user input via mouse clicks or keyboard, the FAST Search Web crawler works from a set of configurable rules it must follow when it requests Web items. This includes, for example, how long to wait between requests for items, and how long to wait before checking for new or updated items.

How the FAST Search Web crawler works

The FAST Search Web crawler starts by comparing the start URL list against include and exclude rules specified in parameters in the XML file containing the crawl configuration. The start URL list is specified with either the start_uris or start_uri_files setting, and the rules via the include_domains and exclude_domains setting. Valid URLs are then requested from their Web servers at a rate determined by the request rate that is configured in the delay setting.

If fetched successfully, the Web item is parsed for hyperlinks and other meta-information, usually by a HTML parser built into the FAST Search Web crawler. The Web item’s meta-information is stored in the FAST Search Web crawler meta-database, and the Web item content (the HTML body) is stored in the FAST Search Web crawler store. The hyperlinks are filtered against the crawl rules, and used as the next set of URLs to be downloaded. This process continues until all reachable content has been gathered, until the refresh interval (refresh setting) is complete or until another configuration parameter limiting the scope of the crawl is reached.

There are many different ways to adjust the configuration to suit a specific Web crawling scenario. The table lists some of the fundamental concepts used to set up and control the FAST Search Web crawler.

Concept Explanation

Crawl collection

A set of Web sites crawled with the same configuration is called a crawl collection. A Web crawler may crawl multiple crawl collections at the same time, and submit these either to a single or separate content collections.

Crawl store

The FAST Search Web crawler stores crawled content locally on disk during crawling. The content is divided into two types; Web item content and meta data.

Include rules

Include rules specify what Web content should be included. However, they do not define where the FAST Search Web crawler should start the crawl.

Exclude rules

Exclude rules specify which hostnames, URLs or URL patterns should not be included in the crawl.

Start URL list

List of URLs to be crawled and collected first, from which additional hyperlinks may be extracted, tested against the rules, and added to work queues for further crawling.

Refresh interval

The time in minutes the FAST Search Web crawler will run before re-crawling the Web sites to check for new or modified content. The refresh interval should be set high enough to ensure the crawler has sufficient time to crawl all content. See the section Determine crawl schedules for information about calculating the refresh interval.

Request rate

The time in seconds between individual requests to a single Web site, configured with the delay setting. This option can be set to 0 to crawl as fast as possible, but requires the Web server owner’s permission. For flexibility, different request rates can be specified with the variable delay setting for different times of day, or days of the week.

Concurrent Web sites

The maximum number of Web sites each node scheduler should crawl at the same time. If there are more Web sites to be crawled than this number then the refresh interval must be increased accordingly.

Crawl speed

The rate at which Web items are collected from the Web sites of a given collection. The maximum rate is the number of concurrent Web sites divided by the request rate.

Duplicate documents

A Web item may in some cases have multiple URLs referencing it. In order to avoid indexing the same Web item multiple times a mechanism known as duplicate detection is used to ensure that only one copy of each unique Web item is indexed.

How to operate the FAST Search Web crawler

To start or stop the FAST Search Web crawler, use the Node Controller. The Node Controller is accessed by the command line tool nctrl.exe. Internally the Web crawler is organized as a collection of processes and logical components, which often run on a single server. It is possible to distribute the Web crawler across multiple servers, allowing the FAST Search Web crawler to gather and process a larger number of Web items from a large number of Web sites. The following table lists the components and associated process:

Component Process Function

Node Scheduler

crawler.exe

Schedules Web crawling on a single farm server.

Multi-node Scheduler

crawler.exe

Schedules Web crawling across farm servers.

Site Manager

crawler.exe

Performs Web crawling, managed by the Node Scheduler.

Post Process

postprocess.exe

Performs duplicate detection and submits content. Managed by the Node Scheduler, but can also be used separately to resubmit all content to the indexer.

File Server

crawlerfs.exe

Allows item processing to retrieve Web pages from the FAST Search Web crawler. Managed by the Node Scheduler and Post Process.

Duplicate Server

ppdup.exe

Performs duplicate detection across farm servers.

Browser Engine

browserengine.exe

Handles content and hyperlink extraction from Web items. Only used when JavaScript support is turned on.

When the FAST Search Web crawler is deployed on a single server, the primary process is known as the Node Scheduler. It has several tasks, including resolving host names to IP addresses, maintaining the crawl configurations and other global jobs. It is also responsible for routing Web sites to one of the Site Manager processes. The Node Scheduler is started (or stopped) by the Node Controller, and is in turn responsible for starting and stopping other Web crawler processes.

The Site Manager manages work queues per Web site and is responsible for fetching pages, computing the checksum of the Web item content, storing the Web items to disk, and associated activities such as Web site authentication if required.

Post Process maintains a database of Web item checksums in order to determine duplicates and is responsible for submitting Web items for indexing. Small Web items are sent directly to the item processing pipeline but larger Web items are sent with only a URL reference. The File Server process is responsible for supplying the Web item contents to any pipeline stage that requests it.

If the number of Web sites or the total number of Web items to be crawled is large, the FAST Search Web crawler can be scaled by distributing it across multiple servers. In this deployment scenario, additional processes are started. The Multi-node Scheduler is added, which performs hostname to IP resolution, contains centralized configuration and logging, and routes URLs to the appropriate Node Scheduler. Each Node Scheduler continues to have a Post Process locally. But each of these must now submit Web item checksums to the Duplicate Servers, which maintain a global database of URLs and content checksums.

In this section: