crawlerglobaldefaults.xml reference

Article
07/22/2014

Applies to: FAST Search Server 2010

Use crawlerglobaldefaults.xml to specify FAST Search Web crawler configuration options that apply to all crawl collections. Configuration options include DNS, content submission, duplicate detection, and other global settings. This is an advanced feature. You will rarely have to use it.

Warning

Any changes that you make to this file will be overwritten and lost if you:

Run the Set-FASTSearchConfiguration Windows PowerShell cmdlet.
Install a FAST Search Server 2010 for SharePoint update or service pack.

To avoid losing your changes, make sure that you back up this file after you have modified it.
Remember to reapply your changes after you run the Set-FASTSearchConfiguration Windows PowerShell cmdlet or install a FAST Search Server 2010 for SharePoint update or service pack.

The FAST Search Web crawler looks for the crawlerglobaldefaults.xml file that is named in <FASTSearchFolder>\etc\ on startup (where <FASTSearchFolder> is the path of the folder where you have installed FAST Search Server 2010 for SharePoint, for example C:\FASTSearch). You can override this location by passing the -F <path> argument to the crawler.exe executable in NodeConf.xml (after you edit NodeConf.xml, restart nctrl.exe or run nctrl.exe reloadcfg).

If a crawlerglobaldefaults.xml file cannot be found, the FAST Search Web crawler reverts to defaults for the settings that can be specified in this file. Some settings can be overridden on the crawler.exe command line. For more information, see crawler.exe reference.

Customizing crawlerglobaldefaults.xml

Note

To modify a configuration file, verify that you meet the following minimum requirements: You are a member of the FASTSearchAdministrators local group on the computer where FAST Search Server 2010 for SharePoint is installed.

To edit this file:

Edit crawlerglobaldefaults.xml in a text editor, not a general-purpose XML editor. Use the existing file in <FASTSearchFolder>\etc\ as a starting point. Include the elements and settings that you must have.
Run nctrl.exe restart crawler to restart the FAST Search Web crawler process with the options that you set in step 1.

If the FAST Search Web crawler is running as a multi-node crawler, this file must be edited on each server where a crawler is running. Each crawler must also be restarted, by running nctrl.exe restart multinodescheduler on the node running the multi-node scheduler and nctrl.exe restart nodescheduler on the servers that are running the node schedulers.

crawlerglobaldefaults.xml quick reference

This table lists the elements in crawlerglobaldefaults.xml. The elements can appear in any order, except for GlobalConfig, in which all sections and attributes must be contained, and member, which can only occur inside an attribute element.

Element	Description
CrawlerConfig	This root element identifies the file as a FAST Search Web crawler configuration file.
GlobalConfig	This element identifies the file as a global configuration settings file for the FAST Search Web crawler.
attrib	This child element specifies a configuration setting, specified either by its value or a set of member elements. Formatted as: `<attrib name="name" type="string\|integer\|real\|boolean"> value </attrib>`
member	This child element can only occur in an attrib element. It specifies a configuration setting in a list, and is formatted as: `<attrib name="name" type="list-string"> <member> first value </member> .. <member> last value </member> </attrib>`
section	This child element contains multiple settings grouped by type.

This table lists the options in crawlerglobaldefaults.xml.

Option	Description
GlobalConfig options	These options are valid inside the GlobalConfig element.
feeding options	These options are valid inside a section element that has the name "feeding". They configure characteristics of submitting Web items to content indexing.
dns options	These attributes specify settings related to the crawler's internal DNS resolver.
near_duplicate_detection options	These options configure the near duplicate detection algorithm for collections that have it enabled.
timeouts options	These options specify global crawler time-out settings.

crawlerglobaldefaults.xml file format

XML elements in crawlerglobaldefaults.xml begin with < and end with />.

The basic element format is as follows:

<attrib name="value" type="value"> value </attrib>

For example:

<attrib name="sitemanager_numsites" type="integer" > 1024 </attrib>

Elements, section names, attributes, and attribute values are case-sensitive. Attribute names and types must be enclosed in quotation marks (" ").An element definition can span multiple lines. Spaces, carriage returns, line feeds, and tab characters are ignored in an element definition.

For example:

<attrib
    name="sitemanager_numsites"
    type="integer"
> 1024 </attrib
>

Tip

For long parameter definitions, position values on separate lines and use indentation to make the file easier to read.

The <GlobalConfig> element is a special case and is required. All other elements are contained within the <GlobalConfig> element, and the element is closed with </GlobalConfig>.

The basic structure of the crawlerglobaldefaults.xml file is as follows:

<?xml version="1.0"?>
<CrawlerConfig>
    <GlobalConfig>
        ...
    </GlobalConfig>
</CrawlerConfig>

You can add comments anywhere, delimited by .

CrawlerConfig

This is the top-level element. It has no attributes.

GlobalConfig

This element contains the global crawler configuration. It has no attributes.

attrib

This child element specifies a configuration option, either a single value or a list using the member element.

Attributes

Attribute	Value	Description
name	option name	Specifies the option to configure. See the valid options in the option sections later in this topic.
type	string\|integer\|real\|boolean\|list-string	Specifies the type for the option value: string - specifies a string type for the option value. integer - specifies an integer type for the option value. The integer range is 0-2^31 unless otherwise stated. real - specifies a real type for the option value. The real range is 0-2^63 unless otherwise stated. boolean - specifies a Boolean type for the option value. Valid Boolean values are "yes" and "no". list-string - specifies that the option value is a list of values, specified by one or more member elements.

name

option name

Specifies the option to configure. See the valid options in the option sections later in this topic.

type

string|integer|real|boolean|list-string

Specifies the type for the option value:

string - specifies a string type for the option value.
integer - specifies an integer type for the option value. The integer range is 0-2^31 unless otherwise stated.
real - specifies a real type for the option value. The real range is 0-2^63 unless otherwise stated.
boolean - specifies a Boolean type for the option value. Valid Boolean values are "yes" and "no".
list-string - specifies that the option value is a list of values, specified by one or more member elements.

The value of the type attribute must match the type associated with the option that is specified for the name attribute. For example, the numprocs option must always be used with the integer type.

Example

The following example specifies the value 2 for the numprocs option:

<attrib name="numprocs" type="integer"> 2 </attrib>

member

This specifies an element in a list of option values. It has no attributes.

The member element can only be used inside an attrib element.

Example

The following example specifies two browser engines for the browser_engines option:

<attrib name="browser_engines" type="list-string">
    <member> hostname1:13045 </member>
    <member> hostname2:13045 </member>
</attrib>

section

This child element groups a set of related options. A section element contains attrib elements.

Attributes

Attribute	Value	Description
name	name	Specifies the name of the section. Supported sections are listed in the options tables later in this topic.

Example

The following example configures the DNS options, specifying only the timeout option:

<section name="dns">
    <attrib name="timeout" type="integer"> 30 </attrib>
</section>

GlobalConfig options

These options are valid inside the GlobalConfig element.

Option	Type	Value	Description
browser_engines	list-string	hostname:port	List of browser engines. The crawler uses these to process Web pages that contain JavaScripts. Default: Configured automatically by the installer
datadir	string	directory	The location of the crawler content store. Overridden by the -d option to crawler.exe.
dbtrace	boolean	yes\|no	Enable/disable database operation tracing. For debugging only. Default: no
directio	boolean	yes\|no	Enable/disable direct I/O in postprocess and duplicate server. For debugging only. Default: no
disk_resume_threshold	real	1-2^63	Threshold (in bytes) at which the crawler resumes crawling of all collections, if they have already been suspended by disk_suspend_threshold. Default: 629145600
disk_suspend_threshold	real	1-2^63	Threshold (in bytes) when the crawler suspends crawling of all collections. Default: 524288000
dns_resolver_threads	integer	1-64	Maximum number of DNS threads. Increasing this value may improve DNS resolve performance if you are crawling a large number of hostnames. Default: 5
dns_use_platform_api	boolean	yes\|no	Specifies whether to use the OS gethostbyname API for resolving DNS names and NetBIOS names, or the internal resolver. The internal DNS resolver offers better performance and scalability, but does not support NetBIOS names. Default: yes
duplicate_servers	list-string	hostname:port	List of duplicate servers. Default: Configured automatically by the installer
logdir	string	directory	The location of the crawler log. Overridden by the -L option to crawler.exe
logfile_ttl	integer	1-2^31	How long (in days) to keep rotated log files before deleting them. Default: 365
numprocs	integer	1-8	Number of site manager processes to start. Default: 2
ppdup_dbformat	string	hastlog\|diskhashlog\|gigabase	Database format that is used by the duplicate server in a multi-node FAST Search Web crawler deployment. Default: hashlog
rc_update_freq	integer	1-3600	Specifies the update frequency of crawl statistics (in seconds) to the monitoring service. Default: 120
sitemanager_numsites	integer	1-1024	Maximum number of site workers per site manager. Default: 1024
store_cleanup	string	hh:mm	Time of the daily storage cleanup that uses 24-hour clock time. Default: 04:00
xmlrpcport	integer	port number	The crawler base port. Overridden by the -p option to crawler.exe

Example

The following example specifies options of different types:

<attrib name="ipv4" type="boolean"> yes </attrib>
<attrib name="numprocs" type="integer"> 2 </attrib>
<attrib name="disk_resume_threshold" type="real"> 629145600 </attrib>
<attrib name="browser_engines" type="list-string">
    <member> localhost:13045 </member>
</attrib>s

feeding options

The following options are valid inside a section element that has the name feeding. These options configure characteristics of submitting Web items to content indexing.

Option	Type	Value	Description
feeder_threads	integer	1-8	Specifies the number of content feeder threads to start. For large-scale scenarios, increasing the number of threads can improve performance. Note Must only be changed when the <FASTSearchFolder>\data\crawler\store\dsqueues directory is empty. Default: 1
fs_threshold	integer	0-2^31	Specifies the maximum size of items sent in a batch to indexing. Any item larger than this value will be sent as a URL reference, which the item processor downloads individually from the crawler. Default: 128
max_batch_datasize	integer	0-2^31	Specifies the maximum number of bytes per batch. Reducing the maximum batch data size may reduce item processor memory usage. Default: 50MB
max_batch_size	integer	1-1024	The maximum number of items in each batch submission. Smaller batches may be sent if not enough items are available, or if the memory size of the batch grows too large. Reducing the maximum batch size may reduce item processor memory usage, but may also decrease performance. Default: 128
max_cb_timeout	integer	1-3600	The maximum number of seconds to wait for outstanding callbacks in content indexing during shutdown. Default: 1800

Example

The following example specifies a typical feeding section:

<section name="feeding">
  <attrib name="feeder_threads" type="integer"> 1 </attrib>
  <attrib name="max_cb_timeout" type="integer"> 1800 </attrib>
  <attrib name="max_batch_size" type="integer"> 128 </attrib>
  <attrib name="max_batch_datasize" type="integer"> 52428800 </attrib>
  <attrib name="fs_threshold" type="integer"> 128 </attrib>
</section>

dns options

These attributes specify settings related to the crawler's internal DNS resolver. In single node installations, the node scheduler calls DNS to resolve host names. In a multiple node installation, this job is performed by the multi-node scheduler.

Option	Type	Value	Description
db_cachesize	integer	1-2^31	DNS database cache size in bytes. A multi-node scheduler will use 4 times this amount. Default: 10485760
ipv4	Boolean	yes\|no	Indicates if the crawler should resolve host names into IPv4 addresses. Default: yes
ipv6	Boolean	yes\|no	Specifies if the crawler should resolve host names into IPv6 addresses. Default: yes
max_rate	integer	1-200	Maximum number of DNS requests to issue per second. Default: 100
max_retries	integer	1-10	Maximum number of DNS retries to issue for a failed lookup before giving up. Default: 5
min_rate	integer	1-10	Minimum number of DNS requests to issue per second. Default: 5
min_ttl	integer	1-2^31	Minimum lifetime of resolved names (in seconds), before it tries to re-resolve. Default: 21600
timeout	integer	1-300	DNS request time-out (in seconds) before retrying. Default 30.

The min_rate, max_rate, max_retries and timeout settings only apply when the internal DNS resolver is used instead of the OS DNS resolver. Refer to the dns_use_platform_api option which controls this setting.You must specify either ip4 or ipv6 set to yes.

Example

The following example specifies a typical DNS section:

<section name="dns">
  <attrib name="min_rate" type="integer"> 5 </attrib>
  <attrib name="max_rate" type="integer"> 100 </attrib>
  <attrib name="max_retries" type="integer"> 5 </attrib>
  <attrib name="timeout" type="integer"> 30 </attrib>
  <attrib name="min_ttl" type="integer"> 21600 </attrib>
  <attrib name="db_cachesize" type="integer"> 10485760 </attrib>
  <attrib name="ipv4 " type="integer"> yes </attrib>
  <attrib name="ipv6 " type="integer"> yes </attrib>
</section>

near_duplicate_detection options

Near duplicate detection is enabled on a per-collection basis. Near duplicate detection only works for languages that use a white space word separator, e.g. western languages. These options configure the near duplicate detection algorithm for collections that have it enabled.

Option	Type	Value	Description
min_token_size	integer	1-(max_token_size-1)	This option specifies the minimum number of characters a token must have to be included in the lexicon (the lexicon is a list of the words that occur in an item). Tokens that contain fewer characters are excluded from the lexicon. Default: 5
max_token_size	integer	1-100	Specifies the maximum character length for a token. Tokens that contain more characters are excluded from the lexicon (the lexicon is a list of the words that occur in an item). Default: 35
unique_tokens	integer	1-10	Specifies the minimum number of unique tokens a lexicon must contain to perform advanced duplicate detection. (A lexicon is the list of the words that occur in an item.) Below this level, the checksum is computed on the whole item. Default: 10
high_freq_cut	real	0.0-1.0	Specifies the percentage of tokens (as a decimal) with a high frequency to cut from the lexicon (a lexicon is the list of the words that occur in an item). Default: 0.1
low_freq_cut	real	0.0-1.0	Specifies the percentage of tokens (as a decimal) with a low frequency to cut from the lexicon (a lexicon is the list of the words that occur in an item). Default: 0.2

Example

The following example specifies a typical near_duplicate_detection section:

<section name='near_duplicate_detection'>
  <attrib name="min_token_size" type="integer"> 5 </attrib>
  <attrib name="max_token_size" type="integer"> 35 </attrib>
  <attrib name="unique_tokens" type="integer"> 10 </attrib>
  <attrib name="high_freq_cut" type="real"> 0.1 </attrib>
  <attrib name="low_freq_cut" type="real"> 0.2 </attrib>
</section>

timeouts options

These options specify various global crawler time-out settings.

Option	Type	Value	Description
compaction_idle	integer	1-3600	Specifies the time-out period (in seconds) for all ongoing crawl activity to stop, in preparation for the nightly content store defragmentation. Site managers that are not idle at this point must be stopped before defragmentation can start. Default: 600
compaction_kill	integer	1-3600	Specifies the time-out period (in seconds) for site managers to shut down before defragmentation. Site manager processes that are not stopped during this time will be killed. Default: 120
shutdown_fileserver	integer	1-3600	Specifies the shut-down time-out period (in seconds) for the file server. Processes that do not shut down within the time-out period are killed. Default: 10
shutdown_postprocess	integer	1-3600	Specifies the shut-down time-out period (in seconds) for postprocess. Processes that do not shut down within the time-out period are killed. Default: 300
shutdown_sitemanager	integer	1-3600	Specifies the shut-down time-out period (in seconds) for the site manager. Processes that do not shut down within the time-out period are killed. Default: 300

Example

The following example specifies a typical time-out section:

<section name="timeouts">
  <attrib name="compaction_idle" type="integer"> 600 </attrib>
  <attrib name="compaction_kill" type="integer"> 120 </attrib>
  <attrib name="shutdown_sitemanager" type="integer"> 300 </attrib>
  <attrib name="shutdown_postprocess" type="integer"> 300 </attrib>
  <attrib name="shutdown_fileserver" type="integer"> 10 </attrib>
</section>

crawlerglobaldefaults.xml reference

Customizing crawlerglobaldefaults.xml

crawlerglobaldefaults.xml quick reference

crawlerglobaldefaults.xml file format

CrawlerConfig

GlobalConfig

attrib

Attributes

Example

member

Example

section

Attributes

Example

GlobalConfig options

Example

feeding options

Example

dns options

Example

near_duplicate_detection options

Example

timeouts options

Example

See Also

Reference

Concepts

Additional resources