crawlerglobaldefaults.xml reference

 

Applies to: FAST Search Server 2010

Use crawlerglobaldefaults.xml to specify FAST Search Web crawler configuration options that apply to all crawl collections. Configuration options include DNS, content submission, duplicate detection, and other global settings. This is an advanced feature. You will rarely have to use it.

Warning

Any changes that you make to this file will be overwritten and lost if you:

  • Run the Set-FASTSearchConfiguration Windows PowerShell cmdlet.

  • Install a FAST Search Server 2010 for SharePoint update or service pack.

To avoid losing your changes, make sure that you back up this file after you have modified it.
Remember to reapply your changes after you run the Set-FASTSearchConfiguration Windows PowerShell cmdlet or install a FAST Search Server 2010 for SharePoint update or service pack.

The FAST Search Web crawler looks for the crawlerglobaldefaults.xml file that is named in <FASTSearchFolder>\etc\ on startup (where <FASTSearchFolder> is the path of the folder where you have installed FAST Search Server 2010 for SharePoint, for example C:\FASTSearch). You can override this location by passing the -F <path> argument to the crawler.exe executable in NodeConf.xml (after you edit NodeConf.xml, restart nctrl.exe or run nctrl.exe reloadcfg).

If a crawlerglobaldefaults.xml file cannot be found, the FAST Search Web crawler reverts to defaults for the settings that can be specified in this file. Some settings can be overridden on the crawler.exe command line. For more information, see crawler.exe reference.

Customizing crawlerglobaldefaults.xml

Note

To modify a configuration file, verify that you meet the following minimum requirements: You are a member of the FASTSearchAdministrators local group on the computer where FAST Search Server 2010 for SharePoint is installed.

To edit this file:

  1. Edit crawlerglobaldefaults.xml in a text editor, not a general-purpose XML editor. Use the existing file in <FASTSearchFolder>\etc\ as a starting point. Include the elements and settings that you must have.

  2. Run nctrl.exe restart crawler to restart the FAST Search Web crawler process with the options that you set in step 1.

If the FAST Search Web crawler is running as a multi-node crawler, this file must be edited on each server where a crawler is running. Each crawler must also be restarted, by running nctrl.exe restart multinodescheduler on the node running the multi-node scheduler and nctrl.exe restart nodescheduler on the servers that are running the node schedulers.

crawlerglobaldefaults.xml quick reference

This table lists the elements in crawlerglobaldefaults.xml. The elements can appear in any order, except for GlobalConfig, in which all sections and attributes must be contained, and member, which can only occur inside an attribute element.

Element Description

CrawlerConfig

This root element identifies the file as a FAST Search Web crawler configuration file.

GlobalConfig

This element identifies the file as a global configuration settings file for the FAST Search Web crawler.

attrib

This child element specifies a configuration setting, specified either by its value or a set of member elements. Formatted as:

<attrib name="name" type="string|integer|real|boolean"> value </attrib>

member

This child element can only occur in an attrib element. It specifies a configuration setting in a list, and is formatted as:

<attrib name="name" type="list-string">   
  <member> first value </member>
  ..
  <member> last value </member>
</attrib>

section

This child element contains multiple settings grouped by type.

This table lists the options in crawlerglobaldefaults.xml.

Option Description

GlobalConfig options

These options are valid inside the GlobalConfig element.

feeding options

These options are valid inside a section element that has the name "feeding". They configure characteristics of submitting Web items to content indexing.

dns options

These attributes specify settings related to the crawler's internal DNS resolver.

near_duplicate_detection options

These options configure the near duplicate detection algorithm for collections that have it enabled.

timeouts options

These options specify global crawler time-out settings.

crawlerglobaldefaults.xml file format

XML elements in crawlerglobaldefaults.xml begin with < and end with />.

The basic element format is as follows:

<attrib name="value" type="value"> value </attrib>

For example:

<attrib name="sitemanager_numsites" type="integer" > 1024 </attrib>

Elements, section names, attributes, and attribute values are case-sensitive. Attribute names and types must be enclosed in quotation marks (" ").An element definition can span multiple lines. Spaces, carriage returns, line feeds, and tab characters are ignored in an element definition.

For example:

<attrib
    name="sitemanager_numsites"
    type="integer"
> 1024 </attrib
>

Tip

For long parameter definitions, position values on separate lines and use indentation to make the file easier to read.

The <GlobalConfig> element is a special case and is required. All other elements are contained within the <GlobalConfig> element, and the element is closed with </GlobalConfig>.

The basic structure of the crawlerglobaldefaults.xml file is as follows:

<?xml version="1.0"?>
<CrawlerConfig>
    <GlobalConfig>
        ...
    </GlobalConfig>
</CrawlerConfig>

You can add comments anywhere, delimited by <!-- and -->.

CrawlerConfig

This is the top-level element. It has no attributes.

GlobalConfig

This element contains the global crawler configuration. It has no attributes.

attrib

This child element specifies a configuration option, either a single value or a list using the member element.

Attributes

Attribute Value Description

name

option name

Specifies the option to configure. See the valid options in the option sections later in this topic.

type

string|integer|real|boolean|list-string

Specifies the type for the option value:

  • string - specifies a string type for the option value.

  • integer - specifies an integer type for the option value. The integer range is 0-2^31 unless otherwise stated.

  • real - specifies a real type for the option value. The real range is 0-2^63 unless otherwise stated.

  • boolean - specifies a Boolean type for the option value. Valid Boolean values are "yes" and "no".

  • list-string - specifies that the option value is a list of values, specified by one or more member elements.

The value of the type attribute must match the type associated with the option that is specified for the name attribute. For example, the numprocs option must always be used with the integer type.

Example

The following example specifies the value 2 for the numprocs option:

<attrib name="numprocs" type="integer"> 2 </attrib>

member

This specifies an element in a list of option values. It has no attributes.

The member element can only be used inside an attrib element.

Example

The following example specifies two browser engines for the browser_engines option:

<attrib name="browser_engines" type="list-string">
    <member> hostname1:13045 </member>
    <member> hostname2:13045 </member>
</attrib>

section

This child element groups a set of related options. A section element contains attrib elements.

Attributes

Attribute

Value

Description

name

name

Specifies the name of the section. Supported sections are listed in the options tables later in this topic.

Example

The following example configures the DNS options, specifying only the timeout option:

<section name="dns">
    <attrib name="timeout" type="integer"> 30 </attrib>
</section>

GlobalConfig options

These options are valid inside the GlobalConfig element.

Option Type Value Description

browser_engines

list-string

hostname:port

List of browser engines. The crawler uses these to process Web pages that contain JavaScripts.

Default: Configured automatically by the installer

datadir

string

directory

The location of the crawler content store. Overridden by the -d option to crawler.exe.

dbtrace

boolean

yes|no

Enable/disable database operation tracing. For debugging only.

Default: no

directio

boolean

yes|no

Enable/disable direct I/O in postprocess and duplicate server. For debugging only.

Default: no

disk_resume_threshold

real

1-2^63

Threshold (in bytes) at which the crawler resumes crawling of all collections, if they have already been suspended by disk_suspend_threshold.

Default: 629145600

disk_suspend_threshold

real

1-2^63

Threshold (in bytes) when the crawler suspends crawling of all collections.

Default: 524288000

dns_resolver_threads

integer

1-64

Maximum number of DNS threads. Increasing this value may improve DNS resolve performance if you are crawling a large number of hostnames.

Default: 5

dns_use_platform_api

boolean

yes|no

Specifies whether to use the OS gethostbyname API for resolving DNS names and NetBIOS names, or the internal resolver.

The internal DNS resolver offers better performance and scalability, but does not support NetBIOS names.

Default: yes

duplicate_servers

list-string

hostname:port

List of duplicate servers.

Default: Configured automatically by the installer

logdir

string

directory

The location of the crawler log. Overridden by the -L option to crawler.exe

logfile_ttl

integer

1-2^31

How long (in days) to keep rotated log files before deleting them.

Default: 365

numprocs

integer

1-8

Number of site manager processes to start.

Default: 2

ppdup_dbformat

string

hastlog|diskhashlog|gigabase

Database format that is used by the duplicate server in a multi-node FAST Search Web crawler deployment.

Default: hashlog

rc_update_freq

integer

1-3600

Specifies the update frequency of crawl statistics (in seconds) to the monitoring service.

Default: 120

sitemanager_numsites

integer

1-1024

Maximum number of site workers per site manager.

Default: 1024

store_cleanup

string

hh:mm

Time of the daily storage cleanup that uses 24-hour clock time.

Default: 04:00

xmlrpcport

integer

port number

The crawler base port. Overridden by the -p option to crawler.exe

Example

The following example specifies options of different types:

<attrib name="ipv4" type="boolean"> yes </attrib>
<attrib name="numprocs" type="integer"> 2 </attrib>
<attrib name="disk_resume_threshold" type="real"> 629145600 </attrib>
<attrib name="browser_engines" type="list-string">
    <member> localhost:13045 </member>
</attrib>s

feeding options

The following options are valid inside a section element that has the name feeding. These options configure characteristics of submitting Web items to content indexing.

Option Type Value Description

feeder_threads

integer

1-8

Specifies the number of content feeder threads to start. For large-scale scenarios, increasing the number of threads can improve performance.

Note

Must only be changed when the <FASTSearchFolder>\data\crawler\store\dsqueues directory is empty.

Default: 1

fs_threshold

integer

0-2^31

Specifies the maximum size of items sent in a batch to indexing. Any item larger than this value will be sent as a URL reference, which the item processor downloads individually from the crawler.

Default: 128

max_batch_datasize

integer

0-2^31

Specifies the maximum number of bytes per batch. Reducing the maximum batch data size may reduce item processor memory usage.

Default: 50MB

max_batch_size

integer

1-1024

The maximum number of items in each batch submission. Smaller batches may be sent if not enough items are available, or if the memory size of the batch grows too large.

Reducing the maximum batch size may reduce item processor memory usage, but may also decrease performance.

Default: 128

max_cb_timeout

integer

1-3600

The maximum number of seconds to wait for outstanding callbacks in content indexing during shutdown.

Default: 1800

Example

The following example specifies a typical feeding section:

<section name="feeding">
  <attrib name="feeder_threads" type="integer"> 1 </attrib>
  <attrib name="max_cb_timeout" type="integer"> 1800 </attrib>
  <attrib name="max_batch_size" type="integer"> 128 </attrib>
  <attrib name="max_batch_datasize" type="integer"> 52428800 </attrib>
  <attrib name="fs_threshold" type="integer"> 128 </attrib>
</section>

dns options

These attributes specify settings related to the crawler's internal DNS resolver. In single node installations, the node scheduler calls DNS to resolve host names. In a multiple node installation, this job is performed by the multi-node scheduler.

Option Type Value Description

db_cachesize

integer

1-2^31

DNS database cache size in bytes. A multi-node scheduler will use 4 times this amount.

Default: 10485760

ipv4

Boolean

yes|no

Indicates if the crawler should resolve host names into IPv4 addresses.

Default: yes

ipv6

Boolean

yes|no

Specifies if the crawler should resolve host names into IPv6 addresses.

Default: yes

max_rate

integer

1-200

Maximum number of DNS requests to issue per second.

Default: 100

max_retries

integer

1-10

Maximum number of DNS retries to issue for a failed lookup before giving up.

Default: 5

min_rate

integer

1-10

Minimum number of DNS requests to issue per second.

Default: 5

min_ttl

integer

1-2^31

Minimum lifetime of resolved names (in seconds), before it tries to re-resolve.

Default: 21600

timeout

integer

1-300

DNS request time-out (in seconds) before retrying.

Default 30.

The min_rate, max_rate, max_retries and timeout settings only apply when the internal DNS resolver is used instead of the OS DNS resolver. Refer to the dns_use_platform_api option which controls this setting.You must specify either ip4 or ipv6 set to yes.

Example

The following example specifies a typical DNS section:

<section name="dns">
  <attrib name="min_rate" type="integer"> 5 </attrib>
  <attrib name="max_rate" type="integer"> 100 </attrib>
  <attrib name="max_retries" type="integer"> 5 </attrib>
  <attrib name="timeout" type="integer"> 30 </attrib>
  <attrib name="min_ttl" type="integer"> 21600 </attrib>
  <attrib name="db_cachesize" type="integer"> 10485760 </attrib>
  <attrib name="ipv4 " type="integer"> yes </attrib>
  <attrib name="ipv6 " type="integer"> yes </attrib>
</section>

near_duplicate_detection options

Near duplicate detection is enabled on a per-collection basis. Near duplicate detection only works for languages that use a white space word separator, e.g. western languages. These options configure the near duplicate detection algorithm for collections that have it enabled.

Option Type Value Description

min_token_size

integer

1-(max_token_size-1)

This option specifies the minimum number of characters a token must have to be included in the lexicon (the lexicon is a list of the words that occur in an item). Tokens that contain fewer characters are excluded from the lexicon.

Default: 5

max_token_size

integer

1-100

Specifies the maximum character length for a token. Tokens that contain more characters are excluded from the lexicon (the lexicon is a list of the words that occur in an item).

Default: 35

unique_tokens

integer

1-10

Specifies the minimum number of unique tokens a lexicon must contain to perform advanced duplicate detection. (A lexicon is the list of the words that occur in an item.) Below this level, the checksum is computed on the whole item.

Default: 10

high_freq_cut

real

0.0-1.0

Specifies the percentage of tokens (as a decimal) with a high frequency to cut from the lexicon (a lexicon is the list of the words that occur in an item).

Default: 0.1

low_freq_cut

real

0.0-1.0

Specifies the percentage of tokens (as a decimal) with a low frequency to cut from the lexicon (a lexicon is the list of the words that occur in an item).

Default: 0.2

Example

The following example specifies a typical near_duplicate_detection section:

<section name='near_duplicate_detection'>
  <attrib name="min_token_size" type="integer"> 5 </attrib>
  <attrib name="max_token_size" type="integer"> 35 </attrib>
  <attrib name="unique_tokens" type="integer"> 10 </attrib>
  <attrib name="high_freq_cut" type="real"> 0.1 </attrib>
  <attrib name="low_freq_cut" type="real"> 0.2 </attrib>
</section>

timeouts options

These options specify various global crawler time-out settings.

Option Type Value Description

compaction_idle

integer

1-3600

Specifies the time-out period (in seconds) for all ongoing crawl activity to stop, in preparation for the nightly content store defragmentation.

Site managers that are not idle at this point must be stopped before defragmentation can start.

Default: 600

compaction_kill

integer

1-3600

Specifies the time-out period (in seconds) for site managers to shut down before defragmentation. Site manager processes that are not stopped during this time will be killed.

Default: 120

shutdown_fileserver

integer

1-3600

Specifies the shut-down time-out period (in seconds) for the file server. Processes that do not shut down within the time-out period are killed.

Default: 10

shutdown_postprocess

integer

1-3600

Specifies the shut-down time-out period (in seconds) for postprocess. Processes that do not shut down within the time-out period are killed.

Default: 300

shutdown_sitemanager

integer

1-3600

Specifies the shut-down time-out period (in seconds) for the site manager. Processes that do not shut down within the time-out period are killed.

Default: 300

Example

The following example specifies a typical time-out section:

<section name="timeouts">
  <attrib name="compaction_idle" type="integer"> 600 </attrib>
  <attrib name="compaction_kill" type="integer"> 120 </attrib>
  <attrib name="shutdown_sitemanager" type="integer"> 300 </attrib>
  <attrib name="shutdown_postprocess" type="integer"> 300 </attrib>
  <attrib name="shutdown_fileserver" type="integer"> 10 </attrib>
</section>

See Also

Reference

crawler.exe reference

Concepts

Web Crawler XML configuration reference