Customizing the Browser Engine

Article
07/22/2014

Applies to: FAST Search Server 2010

How the Browser Engine works

The FAST Search Web crawler uses the Browser Engine component to extract content and hyperlinks from Web items that contain embedded JavaScript(s). The use of this component is controlled by the use_javascript option in the crawl configuration.

When JavaScript support is turned on, the FAST Search Web crawler will examine each HTML Web item to see if it contains one or more <script> tags. If there are any, the Web item is sent to the Browser Engine to be parsed, bypassing the normal HTML hyperlink extraction process.

The Browser Engine functions like an automated Web browser. It loads the Web item, including all referenced cascading style sheets (CSS) and external JavaScripts URLs, and creates a DOM (document object model) from the Web item. As part of this process, all JavaScript code that normally runs upon loading the Web item is ran, and any generated hyperlinks (for example navigation events) are aggregated. Once fully loaded, the Web item is sent through an internal mini pipeline running a set of small extractors responsible for:

Simulating user clicks on hyperlinks
Filling out simple forms (dropdowns, checkboxes, radio boxes etc.)
Aggregating cookies
Generating a Web item checksum and a new HTML (for later item processing in the FAST Search item processing stages)

All hyperlinks that are encountered while running the extractors mentioned above are aggregated as well, and the new HTML, checksum and all hyperlinks are sent back to the FAST Search Web crawler.

Unlike the standard FAST Search Web crawler hyperlink extractor, which only looks at the HTML body of a Web item, the Browser Engine also requires external URLs referenced by the Web item. As this increases the load on Web servers being crawled it is possible to adjust the rate at which these are downloaded separately from the normal request rate. This is done through the javascript_delay option. The default is (quite aggressively) set to 0, meaning there is no delay (like a Web browser).

As the Browser Engine processing is quite CPU intensive it is possible to overload the Browser Engine by crawling too fast. If the Browser Engine does not have any available capacity when receiving a processing request it returns an “overload message” to the Web crawler. The Web crawler will attempt to send the request to another Browser Engine, if one is available, or otherwise delay processing until the Browser Engine is again ready to process additional Web items.