Determine what Web content to crawl
Published: May 12, 2010
When building a search index, it is critical to exclude duplicate or otherwise less useful content. For example, you will probably want to exclude the empty pages of an online calendar system. Keep in mind what content you want to exclude when configuring the FAST Search Web crawler.
Determine the start URLs
The start URL list provides the initial set of URLs to Web sites/items for the Web crawler to retrieve. As each URL is fetched, the Web crawler parses the Web item to locate additional hyperlinks to the same Web site as well as to other Web sites.
If the start URL list contains URLs to more Web sites than the number of Web sites the Web crawler is allowed to crawl at the same time, as configured by the max_sites setting, then some Web sites will remain queued until another Web site completes crawling, at which point a new Web site can be processed. To ensure all Web sites are crawled within the refresh interval, the max_inter_docs setting can be used to force a different Web site to be scheduled after the specified number of Web items have been downloaded from each Web site.
This can be expensive with regard to queue structure and the possibility of overflowing file system limits. It is recommended to thoroughly consider the implications on Web scale crawls before enabling the max_inter_docs feature.
Determine include and exclude rules
The first factor to consider is what Web sites should be crawled. If there are no limitations, no rules to restrict it, the FAST Search Web crawler will consider any URL to be valid. In most scenarios, this results in too much data.
Generally, an index is being built for a limited number of known Web sites, identified by their host names. For these Web sites, one or more start URLs are specified, giving the Web crawler a starting point within the Web site. An include rule corresponding to the start URL can be specific, for example, an exact match for www.contoso.org. Or, it can be more general to match all Web sites in a given DNS domain, for example, any host name matching the suffix .contoso.com.
However, there are often exceptions to the general include rules. For example if a hostname within a DNS domain should not be crawled, or particular parts of the Web site should be excluded. These exceptions can be entered in the exclude filters of the configuration file, where you can further specify domains or URLs to be excluded from the crawl.
As the crawler fetches Web items and parses them for new URLs to crawl, it will evaluate each candidate URL against its configured rules. If a URL matches either an include host name or include URL filter rule, while not matching an exclude host name or exclude URL filter rule, then it is considered eligible for further processing, and possibly fetching.
Within a given Web site, you can configure the crawler to gather all pages or only crawl to a limited depth using the crawl_mode setting. Use the max_doc setting to set an overall limit on the number of Web items downloaded from a specific Web site per refresh interval. This can be helpful if a large number of Web sites are being crawled, which results in a large number of Web items being fetched from a "deep" Web site. This would limit the resources available and starve other Web sites. Use the cut_off setting to specify a size limit per Web item. Then, specify consequent behavior for Web items exceeding this threshold with the truncate setting.