Customizing the Browser Engine

 

Applies to: FAST Search Server 2010

How the Browser Engine works

The FAST Search Web crawler uses the Browser Engine component to extract content and hyperlinks from Web items that contain embedded JavaScript(s). The use of this component is controlled by the use_javascript option in the crawl configuration.

When JavaScript support is turned on, the FAST Search Web crawler will examine each HTML Web item to see if it contains one or more <script> tags. If there are any, the Web item is sent to the Browser Engine to be parsed, bypassing the normal HTML hyperlink extraction process.

The Browser Engine functions like an automated Web browser. It loads the Web item, including all referenced cascading style sheets (CSS) and external JavaScripts URLs, and creates a DOM (document object model) from the Web item. As part of this process, all JavaScript code that normally runs upon loading the Web item is ran, and any generated hyperlinks (for example navigation events) are aggregated. Once fully loaded, the Web item is sent through an internal mini pipeline running a set of small extractors responsible for:

  • Simulating user clicks on hyperlinks

  • Filling out simple forms (dropdowns, checkboxes, radio boxes etc.)

  • Aggregating cookies

  • Generating a Web item checksum and a new HTML (for later item processing in the FAST Search item processing stages)

All hyperlinks that are encountered while running the extractors mentioned above are aggregated as well, and the new HTML, checksum and all hyperlinks are sent back to the FAST Search Web crawler.

Unlike the standard FAST Search Web crawler hyperlink extractor, which only looks at the HTML body of a Web item, the Browser Engine also requires external URLs referenced by the Web item. As this increases the load on Web servers being crawled it is possible to adjust the rate at which these are downloaded separately from the normal request rate. This is done through the javascript_delay option. The default is (quite aggressively) set to 0, meaning there is no delay (like a Web browser).

As the Browser Engine processing is quite CPU intensive it is possible to overload the Browser Engine by crawling too fast. If the Browser Engine does not have any available capacity when receiving a processing request it returns an “overload message” to the Web crawler. The Web crawler will attempt to send the request to another Browser Engine, if one is available, or otherwise delay processing until the Browser Engine is again ready to process additional Web items.

Customizing the Browser Engine

The Browser Engine is configured by modifying the BEConfig.xml configuration file, located in the etc folder of the FAST Search Server 2010 for SharePoint installation directory on each farm server where a Browser Engine is deployed. Tuning the Browser Engine is mainly a matter of adjusting the set of timeouts and cache sizes to match how the FAST Search Web crawler is used.

After modifying the BEConfig.xml file, the Browser Engine must be restarted.

Restart the Browser Engine

  1. Verify that you meet the following minimum requirements: You are a member of the FASTSearchAdministrators local group on the computer where FAST Search Server 2010 for SharePoint is installed.

  2. On the Start menu, click All Programs.

  3. Click FAST Search Server 2010 for SharePoint.

  4. Click the FAST Search Server 2010 for SharePoint Management Shell.

  5. At the Windows PowerShell command prompt, type the following command:

    nctrl restart browserengine