Configure the FAST Search Web crawler
Published: May 12, 2010
The FAST Search Web crawler is highly customizable and is able to scale to large environments, for example when your organization is crawling a large number of external websites.
You install the FAST Search Web crawler by installing FAST Search Server 2010 for SharePoint. Post installation, you need to configure the FAST Search Web crawler for use.
Configuration file templates
The configuration of the FAST Search Web crawler is done by editing a copy of an XML file. You can operate the FAST Search Web crawler through a number of command line tools.
In the <FASTSearchFolder>\etc folder, where <FASTSearchFolder> is the path of the folder where you have installed FAST Search Server 2010 for SharePoint, for example C:\FASTSearch, you will find three configuration templates:
The simple template contains a minimum set of configuration parameters to be entered. This template is the starting point to set up a FAST Search Web crawler. The advanced template describes most options and contains an extensive number of comments. This file can be used as a reference to enable you to set up complex FAST Search Web crawler environments. The RSS template contains the configuration options needed to set up a crawl of RSS content.
Before you set up the first FAST Search Web crawler configuration, copy to the file CrawlerCollectionDefaults.xml.generic.template (located in <FASTSearchFolder>\META\config\profiles\default\templates\installer\) to the <FASTSearchFolder\etc folder. Then rename this file to CrawlerCollectionDefaults.xml.
In case you have a multi-server deployment, copy the CrawlerCollectionDefaults.xml file to the <FASTSearchFolder>\etc folder of each server in the farm that hosts a FAST Search Web crawler component.
If FAST Search Server 2010 for SharePoint was started, you must restart the FAST Search Web crawler process for the changes to take effect.
To set up a basic FAST Search Web crawler configuration, copy the file CrawlerConfigTemplate-Simple.xml that you can find in the <FASTSearchFolder>\etc folder. Give the copied file a new unique name and configure it for your environment. In the same folder, you will find the files CrawlerCollectionDefaults.xml and CrawlerGlobalDefaults.xml. These files contain several default settings for all collections, and do not have to be edited.
If you decide not to use certain sections in the configuration file, consider keeping these sections in your configuration file, but with empty values, instead of removing them from the XML. This will make implementing partial configurations at a later stage easier, as you can keep the remaining configuration intact.
The minimum FAST Search Web crawler configuration should contain information about where the FAST Search Web crawler should start the crawl, what to crawl, how fast and for how long. Specify start URLs first and then set up additional include and exclude rules to limit the number of Web sites being crawled. If you do not set up additional rules, any Web site will be crawled, which is likely to overload the system and crawl unwanted content.
Then, specify how long the FAST Search Web crawler should wait between each download request to a Web site before issuing another request.
The refresh interval depends on the type and amount of content on the Web sites that you are crawling. You can organize highly dynamic Web sites in a separate collection and configure this collection to have a short refresh interval, for example.
To calculate the necessary refresh interval, multiply the expected amount of Web items on the largest Web site with the request rate and divide by 60. For example, a web site that contains 1000 items and is crawled with a request rate of 30 (meaning 30 seconds between retrievals), requires a minimum of (1000*30)/60=500 minutes refresh interval.
Finally, determine how to access the content. For example, you may have to specify a HTTP proxy for the FAST Search Web crawler using the proxy setting. You may also have to configure any authentication scheme and credentials needed using the logins setting. Refer to the article Web Crawler XML configuration reference for more information about advanced configuration settings.