Export (0) Print
Expand All

Web Crawler XML configuration reference

FAST Search Server 2010
 

Applies to: FAST Search Server 2010

Topic Last Modified: 2011-08-05

The FAST Search Web crawler automatically retrieves information from Web sites, and passes this information to the Microsoft FAST Search Server 2010 for SharePoint index. The FAST Search Web crawler is configured by creating an XML configuration file formatted as specified in this article, and submitting it to the Web crawler using the crawleradmin.exe command-line tool.

The format specified in this document is also used by the crawlercollectiondefaults.xml file, which contains all the default options/values for new crawl collections. When you modify it, you change the defaults for all new collections. The default values are used for any option not specified in the XML configuration created for a specific crawl collection.

These configuration files must be formatted in compliance with the XML schema. This document includes a simple and a typical example of a configuration file. For an overview of the elements and sections in the configuration file, refer to the table in crawlercollectiondefaults.xml quick reference.

Web site refers not to a SharePoint site, but to the content on a Web site such as www.contoso.com.

Host name refers to either "contoso" in http://contoso/ or "download.contoso.com" in http://download.contoso.com/. It can be either fully qualified or not. In this document, the difference between a Web site and a host name is that a Web site describes the actual site and its content, whereas the host name is the network name that is used to reach a given Web server. A single site might have multiple host names.

noteNote
To modify a configuration file, verify that you meet the following minimum requirements: You are a member of the FASTSearchAdministrators local group on the computer where FAST Search Server 2010 for SharePoint is installed.

Follow these steps to create a new crawl configuration using this XML configuration format:

  1. Copy one of the three supplied crawl configuration templates found in <FASTSearchFolder>\etc (where <FASTSearchFolder> is the path of the folder where you have installed FAST Search Server 2010 for SharePoint, for example C:\FASTSearch) to a new file such as MyCollection.xml, or create a new file. Edit the file in a text editor to include the elements and settings that you must have.

    noteNote
    Use a text editor (e.g., Notepad) to change crawlercollectiondefaults.xml. Do not use a general-purpose XML editor.
  2. Run crawleradmin.exe –f MyCollection.xml to add the crawl configuration to the crawler. Replace MyCollection.xml with the name that you gave the file in step 1.

See crawleradmin.exe reference for more information.

warningWarning
Any changes that you make to this file will be overwritten and lost if you:
  • Run the Set-FASTSearchConfiguration Windows PowerShell cmdlet.

  • Install a FAST Search Server 2010 for SharePoint update or service pack.

To avoid losing your changes, make sure that you back up this file after you have modified it.
Remember to reapply your changes after you run the Set-FASTSearchConfiguration Windows PowerShell cmdlet or install a FAST Search Server 2010 for SharePoint update or service pack.
noteNote
To modify a configuration file, verify that you meet the following minimum requirements: You are a member of the FASTSearchAdministrators local group on the computer where FAST Search Server 2010 for SharePoint is installed.

To edit this file:

  1. Edit crawlercollectiondefaults.xml in a text editor to include the elements and settings that you must have. Use the existing file in <FASTSearchFolder>\etc\ as a starting point.

    noteNote
    Use a text editor (for example Notepad) to change crawlercollectiondefaults.xml. Do not use a general-purpose XML editor.
  2. Run nctrl.exe restart crawler to restart the FAST Search Web crawler process with the options that you set in step 1.

This table lists the elements in the Web Crawler XML configuration format. The elements can appear in any order with the following exceptions. CrawlerConfig holds the DomainSpecification element. The primary elements of SubDomain, Login, and Node occur inside the DomainSpecification element. The section and attrib sub-elements can occur in any of the primary elements, in any order. The member sub-elements must appear inside an attrib element only.

<CrawlerConfig>
      <DomainSpecification>
             <SubDomain/>
             <Login/>
             <Node/>
             <attrib>
                    <member/> 
             </attrib>
             <section/>
      </DomainSpecifcation>
</CrawlerConfig>

Typically, you will include both attrib and section sub-elements in SubDomain, Login, and section elements. The Node element may contain all these elements and sub-elements.

 

Element Description

CrawlerConfig

This top-level element specifies that the XML following it is a Web crawler configuration object.

DomainSpecification

This element specifies a crawl collection.

SubDomain

This element specifies the configuration of crawl sub collections.

Login

This element is used for HTML forms-based authentication.

Node

This element overrides configuration parameters in a crawl collection or a crawl sub collection for a particular node scheduler.

attrib

This sub-element specifies a configuration setting, either by its value or by a set of member elements.

member

This sub-element specifies a configuration setting in a list.

section

This sub-element specifies a section that contains multiple settings grouped by type. A table listing all possible sections follows.

This table defines the section options in the Web Crawler XML configuration format. Sections cannot occur inside the CrawlerConfig element.

 

Section name Description

include_domains

Defines a set of host name filters that specify which URIs to include in a crawl collection

exclude_domains

Defines a set of host name filters that specify which URIs to exclude from a crawl collection

include_uris

Defines a set of URI rules that specify which URIs to include in a crawl collection

exclude_uris

Defines a set of URI rules that specify which URIs to exclude from a crawl collection

log

Specifies logging behavior for the Web crawler process

storage

Specifies how the Web crawler stores content and metadata

pp

Specifies the post processing behavior for a node scheduler

ppdup

Specifies duplicate server settings

feeding

Consists of at least one section element that specifies how to send a representation of the crawl collection to the indexing engine

cachesize

Configures the cache sizes for the Web crawler process

http_errors

Specifies how to handle HTTP/HTTPS error response codes and conditions

ftp_errors

Specifies how to handle response codes and error conditions for FTP URIs

workqueue_priority

Specifies the priority levels for the crawl queues, and specifies the rules and modes used to insert URIs into and extract URIs from the queues

link_extraction

Specifies which kind of hyperlinks to follow

limits

Specifies fail-safe limits for a crawl collection

focused

Configures focused scheduling

passwd

Configures credentials for Web sites that require authentication

ftp_acct

Specifies FTP accounts for crawling FTP URIs

exclude_headers

Specifies items to exclude from the crawl, based on the contents of the HTTP header fields

variable_delay

Specifies time slots that use a different delay request rate

adaptive

Specifies the adaptive crawling options

weights

Each URI is given a score in the adaptive crawling process. The weights section must occur inside an adaptive section.

sitemap_weights

<URL> entries in a sitemap can contain a changefreq element, which specifies how frequently a URI can be modified. The string values are converted into a numeric weight for adaptive crawling. The sitemap_weights section must occur in an adaptive section.

site_clusters

Specifies configuration parameters that override the crawler’s usual behavior of routing host names in a node scheduler

crawlmode

Limits the span of a crawl collection

post_payload

Submits content to HTTP POST requests

rss

Initializes and configures RSS feed support in a crawl collection

logins

This is a special case of a Login element; multiple Login elements are merged into a logins section. Either a logins section or one or more Login elements are required when you define HTML forms-based authentication. You must use logins to remove a login because of the way partial configurations work. Note that exporting a configuration from the crawler with crawleradmin returns the Login element.

parameters

Sets the authentication credentials that are used in an HTML form. Must occur in a Login element or a logins section.

subdomains

Specifies the configuration of crawl sub collections. This is a special case of a SubDomain element; multiple SubDomain elements are merged into a subdomains section. You must use subdomains to remove a subdomain because of the way partial configurations work. Note that exporting a configuration from the crawler with crawleradmin returns the SubDomain element.

XML elements in the configuration file begin with < and end with />.

The basic element format is as follows:

<attrib name="value" type="value"> value </attrib>

For example:

<attrib name="accept_compression" type="boolean"> yes </attrib>

Elements, section names, attributes, and attribute values are case-sensitive. Attribute names and types must be enclosed in quotation marks (" ").An element definition can span multiple lines. Spaces, carriage returns, line feeds, and tab characters are ignored in an element definition.

For example:

<attrib
    name=" accept_compression "
    type="boolean"
> yes </attrib
>

tipTip
For long parameter definitions, position values on separate lines and use indentation to make the file easier to read.

The <CrawlerConfig> element is a special case and is required. All other elements are contained within the <CrawlerConfig> element, and the element is closed with </CrawlerConfig>.

The basic structure of the XML file is in the following example:

<?xml version="1.0"?>
<CrawlerConfig>
    <DomainSpecification>
        ...
    </DomainSpecification>
</CrawlerConfig>

You can add comments anywhere, delimited by <!-- and -->.

This top-level element specifies that the XML following it is a Web crawler configuration object. A Web crawler configuration file can contain only one CrawlerConfig XML element.

This element specifies a crawl collection.

<CrawlerConfig>
  <DomainSpecification name="sp">
  ...
  </DomainSpecification>
</CrawlerConfig>

Replace "sp" with the crawl collection name.

This element specifies a configuration option, either a single value or a list using the member element.

 

Name Type Value Meaning

info

string

A text description of the crawl collection.

fetch_timeout

integer

<seconds>

Specifies the maximum downloading time, in seconds, for a Web item. Increase this value if you expect to download large Web items from slow Web servers.

Default: 300

allowed_types

list-string

 

Specifies valid Web item MIME types.

The Web crawler process discards other MIME types. This configuration parameter supports wildcard expansion of a whole field. Wildcards are represented by an asterisk character. For example: "text/*" or "*/*" but not "*/html" or "application/ms*".

Default:

  • text/html

  • text/plain

  • application/msword

  • application/msexcel

  • application/ppt

  • application/pdf

force_mimetype_detection

boolean

yes|no

Specifies that the Web crawler process uses its own MIME type detection on items. In most cases, Web servers return the MIME type of Web items when they are downloaded, as part of the HTTP header. If this option is enabled, Web items will get tagged with the MIME type that looks most accurate: either the one received from the Web server or the result of the crawler’s detection.

Default: no

allowed_schemes

list-string

HTTP

HTTPS

FTP

Specifies the URI schemes that the Web crawler should process.

Default: HTTP

ftp_passive

boolean

yes|no

Specifies that the Web crawler uses passive FTP mode.

Default: yes

domain_clustering

boolean

yes|no

Specifies whether to route host names from the same domain to the same site manager process. Useful when you are dealing with host names that must share information such as cookies, because this information is not exchanged between site manager processes. If enabled in a multiple node configuration, host names on the same domain (for example, www.contoso.com and forums.contoso.com) will also be routed to the same node scheduler.

Default for single node: no

Default for multiple node: yes

max_inter_docs

integer

positive integer, or no value

Specifies the maximum number of items to crawl before interleaving Web sites. By default, the crawler will crawl a Web site to exhaustion, or until the maximum number of Web items per Web site is reached. However, the crawler can be configured to crawl "batches" of Web items from Web sites at a time, interleaving between Web sites. This attribute specifies how many Web items to consecutively crawl from a server before the crawler interleaves and starts crawling other servers. The crawler will return to crawling the former server when resources are freed up.

Default: empty (disabled)

max_redirects

integer

<value>

Specifies the maximum number of HTTP redirects to follow from a URI.

Default: 10

diffcheck

boolean

yes|no

Specifies that the Web crawler performs duplicate detection. Duplicate detection is performed by checking whether two or more Web items have the same content.

Default: yes

near_duplicate_detection

boolean

yes|no

Specifies that the Web crawler must use a less strict duplicate detection algorithm. In this case duplicate items are detected by identifying a unique pattern of words.

Default: no

max_uri_recursion

integer

<value>

Use this attribute to check for repeating patterns in URIs. The option specifies the maximum number of times a pattern can be repeated before the resulting URI is discarded. A value of 0 disables the test.

For example, http://www.contoso.com/widget linking to http://www.contoso.com/widget/widget is a repetition of 1 element.

Default: 5

ftp_searchlinks

boolean

yes|no

Specifies that the Web crawler should search for hyperlinks in items downloaded from FTP servers.

Default: yes

use_javascript

boolean

yes|no

Specifies if JavaScript support should be enabled in the Web crawler. If enabled, the Web crawler will download, parse/execute, and extract links from any external JavaScript.

noteNote
JavaScript processing is resource intensive and should not be enabled for large crawls.
noteNote
Processing JavaScript uses the Browser Engine component. For more information, see beconfig.xml reference.

Default: no

javascript_keep_html

boolean

yes|no

Specifies what to submit to the indexing engine. If this parameter is set to yes, the HTML that results from the JavaScript processing is used. Otherwise, the original HTML item is used.

Do not use this option if the use_javascript configuration parameter is not set to yes.

javascript_delay

real

<seconds>

An empty value means that the Web crawler uses the same value as the delay configuration parameter

Specifies the delay in seconds, to use when you are retrieving dependencies associated with an HTML item with JavaScript.

Default: 0 (no delay)

exclude_exts

list-string

<comma delimited list of file_extensions>

Specifies file name extensions that should be excluded by the crawl.

Default list: empty

use_http_1_1

boolean

yes|no

Specifies that the Web crawler should use HTTP/1.1. When set to no, HTTP/1.0 is used.

Default: yes

accept_compression

boolean

yes|no

Specifies that the Web crawler should accept compressed Web items from the Web server. This parameter has no effect if the use_http_1_1 configuration parameter is not enabled.

Default: yes

dbswitch

integer

<value>

Specifies the number of crawl cycles that a Web item can remain in the crawl store and index without being found by the Web crawler, before it is deleted. The dbswitch_delete parameter determines the action that should be taken for Web items that are not seen for this number of crawl cycles.

noteNote
Setting this value very low to 1 or 2 may accidentally delete Web items.

Default: 5

dbswitch_delete

boolean

yes|no

The Web crawler tries to detect Web items that were removed from the Web servers. This parameter determines what to do with those Web items. They can be deleted immediately or put in the work queue for retrieval to make sure that they are no longer available.

When set to yes, Web items that are too old are deleted. When set to no, Web items are scheduled for re-retrieval and only deleted if they no longer exist on the Web server.

This check is performed independently for each Web site, at the start of each refresh cycle.

noteNote
You should keep this option at the default value.

Default: no

html_redir_is_redir

boolean

yes|no

Use this parameter with html_redir_thresh to treat META Refresh tags inside HTML Web items as if they were HTTP redirects. When enabled, the Web item that contains the META refresh will not be indexed. When disabled, they are treated as regular Web items and are indexed.

Default: yes

hmtl_redir_threshold

integer

<value>

Specifies the maximum number of seconds that a META Refresh tag inside an HTML Web item can be treated as an HTTP redirect. This parameter is ignored if html_redir_is_redir is not set.

Consider the following example:

<META HTTP-EQUIV="Refresh" CONTENT="3;URL=http://www.some.org/some.html">

If the number that is specified in the CONTENT attribute (in this example, it is 3) is less than or equal to the value of html_redir_threshold, the META Refresh tag is treated as a redirect.

Default: 3

robots_ttl

integer

<seconds>

Specifies how frequently the Web crawler should retrieve the robots.txt file from a Web site. The frequency must be specified in seconds.

Default: 86400

use_sitemaps

boolean

yes|no

Enables the Web crawler to discover and parse sitemaps.

The Web crawler uses the lastmod attribute in a sitemap to determine whether a Web item was modified since the last time that the sitemap was retrieved. Web items that were not modified will not be re-crawled.

An exception is if the collection usesadaptive refresh mode. In adaptive refresh mode, the crawler uses a sitemap’s priority and changefreq attributes to determine how often a Web item should be crawled. Other tags found in sitemaps are stored in the crawler’s meta database and are submitted to indexing as crawled properties.

noteNote
Most sitemaps are specified in robots.txt. Thus, the robots attribute should beenabled for the best results.

Default: no

max_pending

integer

<value>

Specifies the maximum number of concurrent HTTP requests to a single Web site at any time.

Default: 2

robots_auth_ignore

boolean

yes|no

Specifies whether the Web crawler should ignore robots.txt if an HTTP 40x authentication error is returned by the Web server. When set to no, the Web crawler will not crawl the Web site upon encountering the error.

The robots.txt standard lists this behavior as a hint for Web crawlers to ignore the Web site completely. However, incorrect configuration of a Web server can incorrectly exclude a site from the crawl. Enable this option to make sure that the Web site is crawled.

Default: yes

robots_tout_ignore

boolean

yes|no

Specifies whether the Web crawler should ignore the robots.txt rules if the request for robots.txt times out.

Before crawling a Web site, the Web crawler requests the robots.txt file from the Web server. By the robots.txt standard, if the request for this file times out, the Web site will not be crawled. Setting this parameter to yes ignores the robots.txt rules in this case, and the Web site is crawled.

noteNote
You should keep this option set to no if you do not own the Web site being crawled.

Default: no

rewrite_rules

list-string

Specifies a set of rules that are used to rewrite URIs.

A rewrite rule has two components: an expression to match (match_pattern), and a replacement string (replacement_string) that will replace the first expression. The expression to match is a grouped match regular expression.

The format of the rewrite rule is as follows: @match_pattern@replacement_string@, where @ is any non-white-space separator character that is not included in the expression itself.

extract_links_from_dupes

boolean

yes|no

Specifies that the Web crawler should extract hyperlinks from duplicate Web items. Even when two Web items have duplicate content, they may have different hyperlinks, which could lead to more content being found by the Web crawler.

Default: no

use_meta_csum

boolean

yes|no

Specifies that the Web crawler includes META tags in the generated duplicate detection fingerprint.

Default: no

csum_cut_off

integer

<value>

Specifies the maximum number of bytes to use to generate the duplicate detection fingerprint. If this parameter is set to 0, the feature is disabled (i.e., unlimited/all bytes will be used).

Default: 0

if_modified_since

boolean

yes|no

Specifies whether the Web crawler should send HTTP headers that contain a value of If-Modified-Since.

Default: yes

use_cookies

boolean

yes|no

Specifies whether the Web crawler should send and store cookies. This feature is automatically enabled for Web sites that use a login, but can also be turned on for all Web sites.

Default: no

uri_search_mime

list-string

<values>

Specifies the MIME types from which the Web crawler extracts hyperlinks.

This configuration parameter supports wildcard expansion only at the whole field level. A wildcard is represented by the asterisk character; for example, text/* or */* but not */html or application/ms*.

Default:

  • text/html

  • text/vnd.wap.xml

  • text/wml

  • text/x-wap.wml

  • x-application/wml

  • text/x-hdml

max_backoff_counter

integer

<value>

Together with max_backoff_delay, this option controls the algorithm by which a Web site experiencing connection failures is contacted less frequently.

For each consecutive network error, the request delay for that Web site is increased by the original delay setting, up to a maximum of max_backoff_delay seconds. This delay is maintained until a request successfully is completed, but for no more than max_backoff_counter number of requests. If the maximum count is reached, crawling of the Web site is temporarily stopped.

Otherwise, when network issues affecting the Web site are resolved, the internal backoff counter starts decreasing, and the request delay is decreased by half on each successful Web item download until the original delay setting is reached.

Default: 50

max_backoff_delay

integer

<seconds>

See max_backoff_counter.

Default: 600

delay

real

<seconds>

Specifies how frequently (in seconds) the Web crawler can retrieve a Web item from a Web site.

Default: 60.0

refresh

real

<minutes>

Specifies how frequently (in minutes) the Web crawler should start a new crawl refresh cycle.

The action that is performed at the time of refresh is determined by the refresh_mode setting.

Default: 1500.0

robots

boolean

yes|no

Specifies that the Web crawler should obey the rules found in robot.txt files.

Default: yes

start_uris

list-string

Specifies start URIs for the Web crawler. The Web crawler needs either start_uris or start_uri_files to start crawling.

noteNote
If the crawl includes any IDNA host names, enter them using UTF-8 characters, not in the DNS encoded format.

start_uri_files

list-string

Specifies a list of files that contain start URIs. These files are stored in plain text file format, with one start URI per line.

noteNote
In a multiple node deployment, these files must only be available on the server that runs the multi-node scheduler.

max_sites

integer

<value>

Specifies the maximum number of Web sites that can be crawled at the same time. In a multi-node Web crawler deployment, this value applies per node scheduler, not to the whole Web crawler.

For example, if max_sites is set to 5 and you have 10 sites to crawl, 5 sites must finish crawling before the crawler can crawl the other 5.

noteNote
A high max_sites value can adversely affect system resource usage.

Default: 128

mirror_site_files

list-string

Specifies a list of files that contain mirror sites for a specified host name. A mirror site is a replica of an already existing Web site. This file uses the following format: a plain text file that has a space-separated list of host names with the preferred name listed first.

noteNote
In a multiple node Web crawler deployment, this file must be available on all servers where a node scheduler is deployed.

proxy

list-string

Specifies a set of HTTP proxies that the Web crawler uses to fetch Web items.

Each proxy is specified by using the following format:

(http://)((domain!)username:password@)hostname(:port), optional parts are enclosed in parentheses.

The password can be encrypted as specified in passwd.

proxy_max_pending

integer

<value>

Specifies a limit on the number of outstanding open connections per HTTP proxy.

Default: maximum value of INT32

headers

list-string

<header>

Specifies additional HTTP headers to add to the request sent to the Web servers.

The current default is as follows: User-Agent: FAST Search Web Crawler <version>

cut_off

integer

Specifies the maximum number of bytes in an item. A Web item larger than this size limit is discarded or truncated depending on the value of the truncate configuration parameter.

If no cut_off configuration parameter is specified, this option is disabled.

Default: no cut-off

truncate

boolean

yes|no

Specifies whether a Web item should be truncated when a Web item exceeds the specified cut_off threshold.

Default: yes

check_meta_robots

boolean

yes|no

Specifies that the Web crawler should follow the <NoIndex /> and <NoFollow /> directives given by the robots META tag.

For example, a typical META tag might be:

<meta name="robots" content="nofollow,noindex"/>

or

<meta http-equiv="robots" content="nofollow,noindex"/>

The special value none means both nofollow and noindex.

Default: yes

obey_robots_delay

boolean

yes|no

Specifies that the Web crawler should follow the crawl-delay directive (if present) in robots.txt files. Otherwise, the delay setting is used.

Default: no

key_file

string

Specifies the path of an SSL client certificate key file that is used for HTTPS connections.

This feature is used for Web sites that require the Web crawler to authenticate itself using a client certificate.

This option must be used with cert_file.

noteNote
In a multi-node Web crawler deployment, the file must be on all node schedulers.

cert_file

string

Specifies the path of an X509 client certificate file that is used for HTTPS connections.

This option must be used with key_file.

max_doc

integer

<value>

Specifies the maximum number of Web items to download from a Web site.

Default: 100000

enforce_delay_per_ip

boolean

yes|no

Specifies that the Web crawler limits requests to Web servers whose names map to a shared IPv4 or IPv6 address. This parameter depends on the delay configuration parameter.

Default: yes

wqfilter

boolean

yes|no

Specifies whether the Web crawler should use a bloom filter that removes duplicate URIs from the crawl queues.

Default: yes

smfilter

integer

<value>

Specifies the maximum number of bits in the bloom filter that removes duplicate URIs from the queue associated with the node scheduler.

A bloom filter is a space-efficient probabilistic data structure (a bit array) which is used to test whether an element is a member of a given set. The test may yield a false positive but never a false negative.

Default: 0

mufilter

integer

<value>

Specifies the maximum number of bits used in the bloom filter, which removes duplicate URIs, which are sent from a node scheduler to a multi-node scheduler.

We recommend that you turn on this filter for large crawls, with a value of 500000000 (500 megabit).

Default: 0

umlogs

boolean

yes|no

Specifies whether all logging is sent to the multi-node scheduler for storage.

If this parameter is not enabled, logs reside only on the node schedulers.

Default: yes

sort_query_params

boolean

yes|no

Specifies whether the Web crawler should sort the parameters in the query component of a URI.

Typically, query components are key-value pairs that are separated by semicolons or ampersands. When this configuration parameter is set, the query is sorted alphabetically by the key name.

Default: no

robots_timeout

integer

<seconds>

Specifies the maximum number of seconds that the Web crawler can use to download a robots.txt file.

Default: 300

login_timeout

integer

<seconds>

Specifies the maximum number of seconds that the Web crawler can use for a login request.

Default: 300

send_links_to

string

Specifies a crawl collection name to which all extracted hyperlinks are sent.

cookie_timeout

integer

<seconds>

Specifies the maximum number of seconds a session cookie is stored. A session cookie is a cookie that has no expiration date.

Default: 300

refresh_when_idle

boolean

yes|no

Specifies whether the Web crawler should trigger a new crawl refresh cycle when it becomes idle. This option should not be used in a multi-node installation.

Default: no

refresh_mode

string

append|prepend|scratch|soft|adaptive

Specifies the refresh mode of a crawl collection. Valid values are as follows:

  • append: Add the start URIs to the end of the crawl queue when a crawl refresh cycle begins. Existing queues are retained.

  • prepend: Add the start URIs to the beginning of the crawl queue when a crawl refresh cycle begins. Existing queues are retained.

  • scratch: Erase the crawl queue before appending the start URIs to the queue.

  • soft: If the crawl queue for a Web site is not empty at the end of a crawl refresh cycle, the Web crawler continues crawling into the next crawl refresh cycle. A crawl site is not refreshed until the crawl queue is empty.

  • adaptive: Build crawl queue according to the adaptive configuration.

Default: scratch

<attrib name="delay" type="real"> 60.0 </attrib>
<attrib name="max_doc" type="integer"> 10000 </attrib>
<attrib name="use_javascript" type="boolean"> no </attrib>
<attrib name="info" type="string">
My Web crawl collection crawling my intranet.
</attrib>
<attrib name="allowed_schemes" type="list-string">
    <member> http </member>
    <member> https </member>
</attrib>

This specifies an element in a list of option values.

The member element can only be used inside an attrib element.

<attrib name="allowed_schemes" type="list-string">
    <member> http </member>
    <member> https </member>
</attrib>

This element groups a set of related options. A section element contains attrib elements.

 

Attribute Value Description

name

<name>

Specifies the name of the section. Supported sections are described in this article.

<section name="crawlmode">
    <attrib name="fwdlinks" type="boolean"> no </attrib>
    <attrib name="fwdredirects" type="boolean"> no </attrib>
    <attrib name="mode" type="string"> FULL </attrib>
    <attrib name="reset_level" type="boolean"> no </attrib>
</section>

This section is a set of host name filters that specify which URIs to include in a crawl collection. An empty section matches any host name.

The following table specifies attrib elements for this section.

 

Name Type Value Meaning

exact

list-string

Specifies a list of host names. If the host name of a URI exactly matches one of these host names, the URI is included by this rule.

prefix

list-string

Specifies a list of host names. If the host name of a URI begins with one of these host names, the URI is included by this rule.

suffix

list-string

Specifies a list of host names. If the host name of a URI ends with one of these host names, the URI is included by this rule.

regexp

list-string

Specifies a list of regular expressions. If the host name of a URI matches one of these regular expressions, the URI is included by this rule.

ipmask

list-string

Specifies a list of IPv4 address masks. If the IPv4 address of a retrieved URI matches one of these IPv4 address masks, the URI is include by this rule. An IPv4 address mask must follow one of the following formats:

  • A range of IPv4 address can be specified by writing an IPv4 address in string format and using a hyphen for the range. For example: 207.46.197.0-100 or 207.46.190-197.100

    If an IPv4 address is within this range, it is included by this mask.

  • An IPv4 mask can also be specified by examining the N most significant bits of an IPv4 address, where N is within the range of {0, 32}.

    The mask is an IPv4 address in string format followed by a forward slash and the number of most significant bits For example: 207.46.197.0 /24

    If an IPv4 address has the same N bits of the specified IPv4 address, it is included by this mask.

  • An IPv4 mask can also be specified by using a bitmask to mask out important bits of an IPv4 address.

    The format of this mask is IPv4 address in string format:ip-mask, where ip-mask is an IPv4 address in string format that is used for masking or a 32 bit hexadecimal digit. For example: 207.46.197.0:255.255.255.0 or 207.46.197.0:0xffffff00

    If an IPv4 address has the same bits set as specified by the ip-mask and the IPv4 address, it is included by this mask.

ip6mask

list-string

Specifies a list of IPv6 address masks. If the IPv6 address of a retrieved URI matches one of these IPv6 address masks, the URI is included by this rule.

An IPv6 address mask must follow one of the following formats:

  • A range of IPv6 address can be specified by writing an IPv6 address in string format and using a hyphen for the range. For example: 2002:CF2E:C500- C564:0:0:0:0:0 or ::ffff:207.46.197.0-100

    If an IPv6 address is within this range it is included by this mask.

  • An IPv6 mask can also be specified by looking at the N most significant bits of an IPv6 address, where N has the range of {0, 128}.

    This mask is an IPv6 address in string format followed by a forward slash and the number of most significant bits. For example: 2002:CF2E:C500:0:0:0:0:0/60

    If an IPv6 address has the same N bits of the specified IPv6 address, it is included by this mask.

<section name="include_domains">
   <attrib name="exact" type="list-string">
      <member> www.contoso.com </member>
      <member> www2.contoso.com </member>
   </attrib>
   <attrib name="prefix" type="list-string">
      <member> www </member>
   </attrib>
   <attrib name="suffix" type="list-string">
      <member> .contoso.com</member>
      <member> .contoso2.com</member>
   </attrib>
   <attrib name="regexp" type="list-string">
      <member> .*\.contoso\.com </member>
   </attrib>
   <attrib name="file" type="list-string">
       <member> c:\myinclude_domains.txt </member>
   </attrib>
</section>

This section is a set of host name filters that specify which URIs to exclude from a crawl collection. An empty section will not match any host name.

See the table in include_domains for the attrib elements for this section.

<section name="exclude_domains">
   <attrib name="exact" type="list-string">
      <member> www.contoso.com </member>
      <member> www2.contoso.com </member>
   </attrib>
   <attrib name="prefix" type="list-string">
      <member> www </member>
   </attrib>
   <attrib name="suffix" type="list-string">
      <member> .contoso.com</member>
      <member> .contoso2.com</member>
   </attrib>
   <attrib name="regexp" type="list-string">
      <member> .*\.contoso\.com </member>
   </attrib>
   <attrib name="file" type="list-string">
       <member> c:\myexclude_domains.txt </member>
   </attrib>
</section>

This section is a set of URI-based rules that specify which URIs to include in a crawl collection. An empty section will match all URIs.

The following table specifies attrib elements for this section.

 

Name Type Value Meaning

exact

list-string

Specifies a list of URIs. If a URI exactly matches one of these URIs, the URI is included by this rule.

prefix

list-string

Specifies a list of strings. If a URI begins with one of these strings, the URI is included by this rule.

suffix

list-string

Specifies a list of strings. If a URI ends with one of these strings, the URI is included by this rule.

regexp

list-string

Specifies a list of regular expressions. If a URI matches one of these regular expressions, the URI is included by this rule.

<section name="include_uris">
   <attrib name="exact" type="list-string">
      <member> http://www.contoso.com/documents/doc2.html </member>
   </attrib>
   <attrib name="prefix" type="list-string">
      <member> http://www.contoso.com/documents/ </member>
   </attrib>
   <attrib name="suffix" type="list-string">
      <member> /doc2.html </member>
   </attrib>
   <attrib name="regexp" type="list-string">
      <member> http://.*\.contoso\.com/documents.*</member>
   </attrib>
   <attrib name="file" type="list-string">
       <member> c:\myinclude_uris.txt </member>
   </attrib>
</section>

This section is a set of URI-based rules that specify which URIs to exclude from a crawl collection. An empty section will not match any URIs.

See the table in include_uris for the attrib elements for this section.

<section name="exclude_uris">
   <attrib name="exact" type="list-string">
      <member> http://www.contoso.com/documents/doc2.html </member>
   </attrib>
   <attrib name="prefix" type="list-string">
      <member> http://www.contoso.com/documents/ </member>
   </attrib>
   <attrib name="suffix" type="list-string">
      <member> /doc2.html </member>
   </attrib>
   <attrib name="regexp" type="list-string">
      <member> http://.*\.contoso\.com/documents.*</member>
   </attrib>
   <attrib name="file" type="list-string">
       <member> c:\myexclude_uris.txt </member>
   </attrib>
</section>

This section specifies logging behavior for the Web crawler process.

The following table specifies attrib elements for this section.

 

Name Type Value Meaning

fetch

string

text|none

Enable/disable logging of downloaded Web items. Valid values are as follows:

  • text: This creates a text-formatted log.

  • none: This disables logging.

Default: text

postprocess

string

text|xml|none

Enable/disable logging of node scheduler item post processing. Valid values are as follows:

  • text: This creates a text-formatted log.

  • xml: This creates an XML formatted log.

  • none: This disables logging.

Default: text

header

string

text|none

Enable/disable logging of HTTP headers. Valid values are as follows:

  • text: This creates a text-formatted log.

  • none: This disables logging.

screened

string

text|none

Enable/disable logging of all screened URIs. Valid values are as follows:

  • text: This creates a text-formatted log.

  • none: This disables logging.

scheduler

string

text|none

Enable/disable logging of adaptive crawling. Valid values are as follows:

  • text: This creates a text-formatted log.

  • none: This disables logging.

dsfeed

string

text|none

Enable/disable the logging of content submission to the indexing engine. Valid values are as follows:

  • text: This creates a text-formatted log.

  • none: This disables logging.

site

string

text|none

Enable/disable logging per crawl site. Valid values are as follows:

  • text: This creates a text-formatted log.

  • none: This disables logging.

<section name="log">
   <attrib name="dsfeed" type="string"> text </attrib>
   <attrib name="fetch" type="string"> text </attrib>
   <attrib name="postprocess" type="string"> text </attrib>
   <attrib name="screened" type="string"> none </attrib>
   <attrib name="site" type="string"> text </attrib>
</section>

This section specifies how the Web crawler stores data and metadata.

The following table specifies attrib elements for this section.

 

Name Type Value Meaning

datastore

string

flatfile|bstore

Specifies the format for Web item content storage. Valid values are as follows:

  • flatfile: This format stores items directly into the file system.

  • bstore: This format partitions items into fixed sized blocks and distributes them across a set of files. An index maps the order of the blocks, and specifies which blocks belong to an item.

Default: bstore

store_http_header

boolean

yes|no

Specifies that the Web crawler should store the received HTTP header.

Default: yes

store_dupes

boolean

yes|no

Specifies that the Web crawler should store duplicate Web items.

Default: no

compress

boolean

yes|no

Specifies that downloaded items should be compressed before storing them.

Default: yes

compress_exclude_mime

list-string

Specifies a set of MIME types of Web items that should not be compressed when stored. Use for Web items that are already compressed, e.g. multimedia formats.

If the compress configuration parameter is not set, this parameter is not applicable.

remove_docs

boolean

yes|no

Specifies that the Web crawler should delete Web items from the Web crawler store as soon as they are submitted to the indexing engine. This will reduce disk space requirements for the Web crawler, but will make it impossible to refeed.

Default: no

clusters

integer

<value>

Specifies the number of clusters to use for storage in a crawl collection. Web items are distributed among these storage clusters.

Default: 8

defrag_threshold

integer

<percentage>

A non-zero value that specifies the threshold value (of used capacity) before defragmenting a data storage file. When the used space is less than thedefrag_threshold, the file is eligible for defragmentation to reclaim fragmented space caused by stored Web items. Database files are compacted regardless of fragmentation level.

The default of 85% means there must be 15% reclaimable space in the data storage file to trigger defragmentation.

A value of 0 disables defragmentation.

This setting is only applicable to the bstore value for the storage_format attribute.

Default: 85

uri_dir

string

<path>

Specifies a path for storing file lists of all hyperlinks that are extracted from Web items. Each site manager process uses a separate file. The name of a URI file is constructed by concatenating the process PID with .txt.

<section name="storage">
   <attrib name="store_dupes" type="boolean"> no </attrib>
   <attrib name="datastore" type="string"> bstore </attrib>
   <attrib name="compress" type="boolean"> yes </attrib>
</section>

This section specifies the post processing behavior for a node scheduler. Post processing consists of two primary tasks: feeding Web items to the index, and performing duplicate detection.

The following table specifies attrib elements for this section.

 

Name Type Value Meaning

use_dupservers

boolean

yes|no

Specifies that the Web crawler should use one or more duplicate servers.

This option is applicable only in a multi-node installation.

Default: no

max_dupes

integer

<value>

Specifies the maximum number of duplicates to record per Web item.

Default: 10

stripe

integer

<value>

Specifies the number of data files to distribute the checksum data into. Increasing this value may improve the performance of post processing.

Default: 1

ds_meta_info

list-string

duplicates|redirects|mirrors|metadata

Specifies the kind of metadata a node scheduler should report to the indexing engine. Valid values are as follows:

duplicates: Reports URIs that are duplicates of this item

redirects: Reports URIs that are redirected to this item

metadata: Reports meta data of this item

mirrors: Reports all mirror URIs of this Web item

ds_max_ecl

integer

<value>

Specifies the maximum number of duplicates or redirects to report to the indexing engine, as specified by the ds_meta_info configuration parameter.

Default: 10

ecl_override

string

Specifies a regular expression that identifies redirect and duplicate URIs that should be stored and possibly submitted to the indexing engine, even though max_dupes is reached. For example: .*index\.html$

ds_send_links

boolean

yes|no

Specifies whether all extracted hyperlinks from a Web item should be sent to the indexing engine.

ds_paused

boolean

yes|no

Specifies whether a node scheduler should suspend the submission of content to the indexing engine.

<section name="pp">
   <attrib name="max_dupes" type="integer"> 10 </attrib>
   <attrib name="use_dupservers" type="boolean"> yes </attrib>
   <attrib name="ds_paused" type="boolean"> no </attrib>
</section>

This section specifies duplicate server settings.

The following table specifies attrib elements for this section.

 

Name Type Value Meaning

format

string

gigabase|hashlog|diskhashlog

Specifies the duplicate server database format. Valid values are as follows:

  • gigabase: Gigabase is a simple key-value database.

  • hashlog: Hashlog is an in-memory data structure consisting of a hash table and a data log. The data log contains all stored keys and values, and can automatically rebuild the in-memory hash table if it is needed.

  • diskhashlog: Diskhashlog is the same as hashlog except that the data structure is accessed directly on disk.

cachesize

integer

<megabytes>

Specifies the duplicate server database cache size in megabytes. If the format configuration parameter is set to hashlog or diskhashlog this parameter specifies the initial size of the hash table.

stripes

integer

<value>

Specifies the number of data files to spread content to. By using multiple files, you can improve the performance of the duplicate server database.

compact

boolean

yes|no

Specifies whether the duplicate server database should perform compaction. For the hashlog and diskhashlog formats, compaction must be performed either manually with the crawlerdbtool or automatically by enabling this option. Otherwise, disk usage will increase for every record written or updated.

Default: yes

<section name="ppdup">
   <attrib name="format" type="string"> hashlog </attrib>
   <attrib name="stripes" type="integer"> 1 </attrib>
      <!-- 1 GB memory hash -->
   <attrib name="cachesize" type="integer"> 1024 </attrib>
   <attrib name="compact" type="boolean"> yes </attrib>
</section>

The feeding section consists of at least one section XML element that specifies how to send a representation of the crawl collection to the indexing engine. Such a section defines a content destination. The name attribute specifies a unique name for the content destination.

The following table specifies attrib elements for a content destination section.

 

Name Type Value Meaning

collection

string

<name>

Specifies the name of the content collection for submitting Web items. This configuration parameter must be specified in a feeding section.

destination

string

default

Reserved. This configuration parameter must contain the value default.

paused

boolean

yes|no

Specifies whether the Web crawler should suspend the submission of content to the indexing engine.

Default: no

primary

boolean

yes|no

Specifies whether this content destination is a primary or secondary content destination.

A primary content destination can act on callback information during content submission to the indexing engine.

If only one content destination is specified, it will be a primary destination.

<section name="feeding">
    <section name="Global_News">
        <attrib name="collection" type="string"> collection_A </attrib>
        <attrib name="destination" type="string"> default </attrib>
        <attrib name="primary" type="boolean"> yes </attrib>
        <attrib name="paused" type="boolean"> no </attrib>
    </section>
    <section name="Local_News">
        <attrib name="collection" type="string"> collection_B </attrib>
        <attrib name="destination" type="string"> default </attrib>
        <attrib name="primary" type="boolean"> no </attrib>
        <attrib name="paused" type="boolean"> no </attrib>
     </section>
</section>

This section configures the cache sizes for the Web crawler process.

The following table specifies attrib elements for this section.

noteNote
The default value for each attribute, if one is not specified in the table, is to have the Web crawler automatically determine the cache size at run time.

 

Name Type Value Meaning

duplicates

integer

<value that represents a number of items>

Specifies the size of the duplicate checksum cache, per site manager process. This cache is used as a first level of duplicate detection at run time.

screened

integer

<value that represents a number of items>

Specifies the size of the screened URI cache, as the number of hyperlinks. The screened cache filters out duplicate hyperlinks that recently resulted in retrieval failures.

smcomm

integer

<value that represents a number of items>

Specifies the size of the bloom filter that is used by the cache filtering out duplicate hyperlinks flowing between the node scheduler and site managers.

mucomm

integer

<value that represents a number of items>

Specifies the size of the bloom filter that is used by the cache filtering out duplicate hyperlinks flowing between the multi-node scheduler and node scheduler.

wqcache

integer

<value that represents a number of items>

Specifies the size of the cache filtering out duplicate hyperlinks from the Web site crawl queues.

crosslinks

integer

<value that represents a number of items>

Specifies the size of the crosslink cache. The crosslink cache contains retrieved hyperlinks and referring hyperlinks. It filters out duplicate hyperlinks in the node scheduler if mufilter is not enabled.

routetab

integer

<value>

Specifies the crawl routing database cache size, in bytes.

Default: 1048576

pp

integer

<value>

Specifies the post process database cache size, in bytes.

Default: 1048576

pp_pending

integer

<value>

Specifies the post process pending cache size, in bytes. The pending cache contains entries that were not sent to the duplicate servers.

Default: 131072

aliases

integer

<value>

Specifies the aliases data mapping database cache size, in bytes. A crawl site can be associated with one or more aliases (alternative host names).

Default: 1048576

<section name="cachesize">
      <!-- Specific cache size values (in number of items) for the following: -->
      <attrib name="duplicates" type="integer"> 128 </attrib>
      <attrib name="screened" type="integer"> 128 </attrib>
      <attrib name="smcomm" type="integer"> 128 </attrib>
      <attrib name="mucomm" type="integer"> 128 </attrib>
      <attrib name="wqcache" type="integer"> 4096 </attrib>
      <!-- Automatic cache size for crosslinks -->
      <attrib name="crosslinks" type="integer"> </attrib>
      <!-- Cache sizes in bytes for the following -->
      <attrib name="routetab" type="integer"> 1048576 </attrib>
      <attrib name="pp" type="integer"> 1048576 </attrib>
      <attrib name="pp_pending" type="integer"> 1048576 </attrib>
      <attrib name="aliases" type="integer"> 1048576 </attrib>
   </section>

This section specifies how to handle HTTP/HTTPS error response codes and conditions.

The following table specifies attrib elements for this section. Because there are multiple values for the name attribute, a description of each purpose is included in the name column.

 

Name

Type

Value

Meaning

The name attribute specifies the HTTP/HTTPS/FTP response code number to handle. The character "X" can be used as a wildcard. For example: 4XX

Other valid values are as follows:

  • net: used to handle network socket errors

  • int: used to handle internal errors in the Web crawler

  • ttl: used to handle HTTP/HTTPS/FTP connection time-outs

string

<value>

Specifies how the Web crawler handles HTTP/HTTPS/FTP and network errors. Valid options for handling individual response codes are as follows:

  • KEEP: Keep the Web item unchanged

  • DELETE[:X]: Delete the Web item if the error condition occurs for the Xth time. Deletion happens immediately if no X value is specified.

If RETRY[:X] is specified for either of these options, the Web crawler will re-download the Web item no more than X times in the same crawl refresh cycle period before failing the attempt. Otherwise, the crawler will not try to download the URI until the next crawl refresh cycle.

Default: See Default values for the http_errors section and Default values for the ftp_errors section.

The following table specifies the default values for the http_errors section.

 

Name Value Meaning

4xx

DELETE:0

Delete immediately.

5xx

DELETE:10

Delete the tenth time this error is encountered for this URI, usually after 10 crawl cycles. The counter is reset if the URI is successfully retrieved.

int

KEEP:0

Do not delete.

net

DELETE:3, RETRY:1

Delete the third time. One retry is specified. This means that the URI will be deleted on the next refresh cycle if it still cannot be retrieved.

ttl

DELETE:3

Delete the third time.

<section name="http_errors">
    <attrib name="408" type="string"> KEEP </attrib>
    <attrib name="4xx" type="string"> DELETE </attrib>
    <attrib name="5xx" type="string"> DELETE:10, RETRY:3 </attrib>
    <attrib name="ttl" type="string"> DELETE:3 </attrib>
    <attrib name="net" type="string"> DELETE:3 </attrib>
    <attrib name="int" type="string"> KEEP </attrib>
</section>

This section specifies how to handle response codes and error conditions for FTP URIs.

See the table in http_errors for the attrib elements for this section.

The following table specifies the default values for the ftp_errors section.

 

Name Value Meaning

4xx

DELETE:3

Delete the third time that this error is encountered for this URI, usually after 3 crawl cycles. The counter is reset if the URI is successfully retrieved.

550

DELETE:0

Delete immediately.

5xx

DELETE:3

Delete the third time, same as for 4xx.

int

KEEP:0

Do not delete.

net

DELETE:3, RETRY:1

Delete the third time. One retry is specified. This means that the URI will be deleted on the next refresh cycle if it still cannot be retrieved.

<section name="ftp_errors">
    <attrib name="4xx" type="string"> DELETE:3 </attrib>
    <attrib name="550" type="string"> DELETE:0 </attrib>
    <attrib name="5xx" type="string"> DELETE:3 </attrib>
    <attrib name="int" type="string"> KEEP:0 </attrib>
    <attrib name="net" type="string"> DELETE:3, RETRY:1 </attrib>
    <attrib name="ttl" type="string"> DELETE:3 </attrib>
</section>

This section specifies the priority levels for the crawl queues, and specifies the rules and modes used to insert URIs into and extract URIs from the queues.

The following table specifies attrib elements for this section.

 

Name Type Value Meaning

levels

integer

<value>

Specifies the number of priority levels used for the crawl queues.

Default: 1

default

integer

<value>

Specifies a default priority level that is assigned to URIs in a crawl queue.

Default: 1

start_uri_pri

integer

<value>

Specifies the priority level for start URIs. See the start_uris and the start_uri_files configuration parameters.

Default: 1

pop_scheme

string

default|rr|wrr|pri

Specifies the mode used by the Web crawler to extract URIs from the crawl queue. Valid values are as follows:

  • rr: This mode extracts URIs from the priority levels in round-robin order.

  • wrr: This mode extracts URIs from the priority levels in a weighted round-robin order. The weights are based on their respective share setting per priority level, as specified in Priority level section.

  • pri: This mode extracts URIs from the priority levels in priority order by when entries still remain in the crawl queue. 1 is the highest priority, as specified in Priority level section.

  • default: This mode is the same as wrr.

Default: default

put_scheme

string

default|include

Specifies which Web crawler mode to use when you insert URIs into the crawl queue. Valid values are as follows:

  • default: This mode always inserts URIs with the priority level specified in the default configuration parameter.

  • include: This mode inserts URIs with the priority level of include_domains or include_uris, as specified in Priority level section for every priority level. The Web crawler process assigns the default priority level when a URI does not match any of these sections.

Default: default

Within the workqueue_priority section, a set of sections, which specify priority levels and weights of the crawler queues, can be specified. These sections will only be used if the pop_scheme parameter is set to wrr or pri. The name attribute of these sections must be the priority level to be specified. The priority levels must begin at 1. (See <section name="1"> in the following example.)

The include_domains or include_uris section can be used within each priority level section, as specified in include_domains and include_uris. URIs that match these rules will be queued using the matching priority level. In addition, the following table specifies attrib elements for these sections.

 

Name Type Value Meaning

share

integer

Specifies a weight to use for each crawl queue. This weight will only be used if the pop_scheme configuration parameter is set to wrr.

<section name="workqueue_priority">
    <attrib name="levels" type="integer"> 2 </attrib>
    <attrib name="default" type="integer"> 2 </attrib>
    <attrib name="start_uri_pri" type="integer"> 1 </attrib>
    <attrib name="pop_scheme" type="string"> wrr </attrib>
    <attrib name="put_scheme" type="string"> include </attrib>
    <section name="1">
        <attrib name="share" type="integer"> 10 </attrib>
        <section name="include_domains">
            <attrib name="suffix" type="list-string">
                <member> web005.contoso.com  </member>
            </attrib>
        </section>
    </section>
    <section name="2">
        <attrib name="share" type="integer"> 5 </attrib>
        <section name="include_domains">
           <attrib name="suffix" type="list-string">
              <member> web002.contoso.com  </member>
           </attrib>
        </section>
    </section>
</section>

This section specifies which kind of hyperlinks to follow.

The following table specifies attrib elements for this section.

 

Name Type Value Meaning

a

boolean

yes|no

Extracts hyperlinks from <A/> HTML tags.

Default: yes

action

boolean

yes|no

Extracts hyperlinks from action attributes in HTML tags.

Default: yes

area

boolean

yes|no

Extracts hyperlinks from <AREA/> HTML tags.

Default: yes

card

boolean

yes|no

Extracts hyperlinks from the <CARD/> Wireless Markup Language tags.

Default: yes

comment

boolean

yes|no

Extracts hyperlinks from comments in a Web item.

Default: yes

embed

boolean

yes|no

Extracts hyperlinks from <EMBED/> HTML tags.

Default: yes

frame

boolean

yes|no

Extracts hyperlinks from <FRAME/> HTML tags.

Default: yes

go

boolean

yes|no

Extracts hyperlinks from <GO/> Wireless Markup Language tags.

Default: yes

img

boolean

yes|no

Extracts hyperlinks from <IMG/> HTML tags.

Default: no

layer

boolean

yes|no

Extracts hyperlinks from <LAYER/> HTML tags.

Default: yes

link

boolean

yes|no

Extracts hyperlinks from <LINK/> HTML tags.

Default: yes

meta

boolean

yes|no

Extracts hyperlinks from <META/> HTML tags.

Default: yes

meta_refresh

boolean

yes|no

Extracts hyperlinks from meta refresh HTML tags (<meta http-equiv="refresh" content="n" />).

Default: yes

object

boolean

yes|no

Extracts hyperlinks from <OBJECT/> HTML tags.

Default: yes

script

boolean

yes|no

Extracts hyperlinks from <SCRIPT/> HTML tags.

Default: yes

script_java

boolean

yes|no

Extracts hyperlinks from <SCRIPT/> HTML tags that contain JavaScript.

Default: yes

style

boolean

yes|no

Extracts hyperlinks from <STYLE/> HTML tags.

Default: yes

<section name="link_extraction">
   <attrib name="action" type="boolean"> yes </attrib>
   <attrib name="img" type="boolean"> no </attrib>
   <attrib name="link" type="boolean"> yes </attrib>
   <attrib name="meta" type="boolean"> yes </attrib>
   <attrib name="meta_refresh" type="boolean"> yes </attrib>
   <attrib name="object" type="boolean"> yes </attrib>
   <attrib name="script_java" type="boolean"> yes </attrib>
</section>

The limits section specifies fail-safe limits for a crawl collection. When the collection exceeds the limit, it enters a "refresh only" crawl mode. This means that only previously-crawled URIs are crawled again.

The following table specifies attrib elements for this section.

 

Name Type Value Meaning

disk_free

integer

<percentage>

Specifies the percentage of free disk space that must be available for the Web crawler to operate in normal crawl mode (specified in the crawlmode attribute). If the percentage becomes less than this limit, the Web crawler enters the "refresh only" crawl mode (when thresholds are reached).

If the parameter is set to 0, this feature is disabled.

Default: 0

disk_free_slack

integer

<percentage>

Specifies slack for the disk_free threshold, as a percentage.

This option creates a buffer zone around the disk_freethreshold. When the free disk space is within this buffer, the Web crawler will not change the crawl mode back to normal. This prevents the Web crawler from switching back and forth between crawl modes when the percentage of free disk space is close to the value specified by the disk_free parameter. When the freedisk space percentage exceeds disk_free + disk_free_slack, normal crawling resumes.

Default: 3

max_doc

integer

<value>

Specifies the number of stored Web items that will cause the crawler to enter "refresh" crawl mode.

noteNote
The threshold is not an exact limit, because statistical reporting is somewhat delayed compared to crawling.

When set to 0, this feature is disabled.

Default: 0

max_doc_slack

integer

<value>

To avoid constant changes with the crawler entering and exiting "refresh only" crawl mode, you can specify athreshold range in addition to the absolute threshold value. The range is defined as: (threshold minus slack), to (threshold), where crawl mode behavior remains unchanged. The max_doc_slack attribute specifies the maximum number of items that can be contained in a slack, up to the max_doc configuration parameter threshold.

Default: 1000

<section name="limits">
   <attrib name="disk_free" type="integer"> 0 </attrib>
   <attrib name="disk_free_slack" type="integer"> 3 </attrib>
   <attrib name="max_doc" type="integer"> 0 </attrib>
   <attrib name="max_doc_slack" type="integer"> 1000 </attrib>
</section>

This section configures focused scheduling. An exclude_domains section can be used within the focused section to exclude host names from this focused scheduling. If no exclude_domains section is defined, all host names are included in the focused scheduling.

The following table specifies attrib elements for this section.

 

Name

Type

Value

Meaning

languages

list-string

Lists the languages for items that can be stored by the Web crawler, as specified in ISO-639-1.

depth

integer

<value>

Specifies the number of page hops to follow for Web items that do not match the specified languages, as set by the languages configuration parameter.

In the following example, the crawler will store all items with Norwegian, English, or unknown language content. For all non-specified languages, the crawler will only follow links to 2 levels. In addition, all content under contoso.com is excluded from the language checks and is automatically stored.

<section name="focused">
   <!-- Crawl Norwegian, English and content of unknown language -->
   <attrib name="languages" type="list-string">
      <member> norwegian </member>
      <member> unknown </member>
      <member> en </member>
   </attrib>
   <!--Follow hyperlinks containing other languages for 2 levels -->
   <attrib name="depth" type="integer"> 2 </attrib>
   <!-- Exclude anything under .contoso.com from language checks, -->   
   <section name="exclude_domains">
      <attrib name="suffix" type="list-string">
         <member> .contoso.com </member>
      </attrib>
   </section>
</section>


This section configures credentials for Web sites that require authentication. The Web crawler supports basic authentication, digest authentication, and NTLM authentication.

The following table specifies attrib elements for this section.

 

Name Type Value Meaning

name

string

The name attribute must contain a URI or realm. A valid URI behaves as a prefix value, because all hyperlinks extracted at its level or deeper use these authentication settings.

The credentials must be specified in one of the following formats: username:password or usename:password:realm:scheme.

The password component of the credential string can be encrypted; if it is not encrypted, it is given in plaintext.

An encrypted password is created by using the crawleradmin tool with the -e option. The encryption algorithm that is used is Advanced Encryption Standard AES-128 with the key found in <FASTSearchFolder>\etc\CrawlerEncryptionKey.dat.

If the credentials are given using the username:password format, the Web crawler automatically uses basic access authentication. Otherwise the configuration must specify an authentication scheme. Valid authentication schemes are as follows:

  • basic

  • digest

  • ntlmv1

  • ntlmv2

  • auto: Specifies that the Web crawler determines the authentication scheme by itself.

<section name="passwd">
    <attrib name="http://www.contoso.com/confidential1/" type="string">
      user:password:contoso:auto
    </attrib>
</section>

This section specifies FTP accounts for crawling FTP URIs.

The following table specifies attrib elements for this section.

 

Name Type Value Meaning

name

string

The value of the name XML attribute is the host name for which this FTP account is valid.

This is the user name and password for this FTP account. The string must be in the format: username:password

<section name="ftp_acct">
   <attrib name="ftp.contoso.com" type="string"> user:pass </attrib>
</section>

This section is used to exclude Web items from the crawl, based on the contents of the HTTP header fields.

The following table specifies attrib elements for this section.

 

Name Type Value Meaning

name

The name attribute is used to set the name of the HTTP header to test.

list-string

Specifies a list of regular expressions. If the value of the specified HTTP header matches one of these regular expressions, the Web item is excluded from the crawl.

<section name="exclude_headers">
   <attrib name="Header Name" type="list-string">
      <member> .*excluded.*value </member>
   </attrib>
</section>

This section specifies time slots that use a different request rate. When no time slot is specified, the crawler uses the delay configuration parameter as specified in attrib.

The following table specifies attrib elements for this section.

 

Name Type Value Meaning

name in the format: DDD:HH.MM-DDD:HH.MM

string

<value in seconds>

suspend

Specifies the delay request rate for this time slot, in seconds. A value of suspend specifies that crawling of this crawl collection will be suspended.

The following example shows how the Web crawler uses different delay intervals during the week. On Wednesday between 9:00 a.m. and 7:00 p.m. the Web crawler uses a delay of 20 seconds. On Monday between 9:00 a.m. and 5:00 p.m. the crawler suspends crawling, and any other time of the week the Web crawler uses a delay of 60 seconds.

<?xml version="1.0" encoding="utf-8"?>
<CrawlerConfig>
   <DomainSpecification name="variable_example">
      <section name="variable_delay">
         <attrib name="Wed:09-Wed:19" type="string">20 </attrib>
         <attrib name="Mon:09-Mon:17" type="string">suspend</attrib>
      </section>
   </DomainSpecification>
</CrawlerConfig>

This section specifies the adaptive crawling options. The refresh_mode configuration parameter, specified in attrib, must be set to adaptive for this section to be used by the Web crawler.

The adaptive crawling behavior can be controlled with the weights and the sitemap_weights sections.

The following table specifies attrib elements for this section.

 

Name Type Value Meaning

refresh_count

integer

<value>

Specifies the number of minor refresh cycles. A refresh cycle can be divided into several fixed size time intervals that are called minor refresh cycles.

Default: 4

refresh_quota

integer

<percentage>

Specifies the ratio of existing re-crawled URIs to new unseen URIs, expressed as a percentage. Setting the percentage low gives preference to new URIs.

Default: 90

coverage_min

integer

<value>

Specifies a minimum number of URIs to crawl per Web site in a minor refresh cycle. Used to guarantee some coverage for small Web sites.

Default: 25

coverage_max_pct

integer

<value>

Specifies a percentage of a Web site to re-crawl in a minor cycle. Ensures that small Web sites are not fully crawled each minor cycle, taking time away from larger Web sites.

Default: 10

        <section name="adaptive">
            <attrib name="refresh_count" type="integer"> 4 </attrib>
            <attrib name="refresh_quota" type="integer"> 98 </attrib>
            <attrib name="coverage_max_pct" type="integer"> 25 </attrib>
            <attrib name="coverage_min" type="integer"> 10 </attrib>

            <!-- Ranking weights. Each scoring criteria adds a score between -->
            <!-- 0.0 and 1.0 which is then multiplied with the associated    -->
            <!-- weight below. Use a weight of 0 to disable a scorer         --> 
        
           <section name="weights">
                <attrib name="inverse_length" type="real"> 1.0 </attrib>
                <attrib name="inverse_depth" type="real"> 1.0 </attrib>
                <attrib name="is_landing_page" type="real"> 1.0 </attrib>
                <attrib name="is_mime_markup" type="real"> 1.0 </attrib>
                <attrib name="change_history" type="real"> 10.0 </attrib>
            </section>
        </section>

In this section, each URI is given a score in the adaptive crawling process. The score prioritizes URIs and is based on a set of rules. Each rule is assigned a weight that determines its contribution towards the total score that is specified in the weights section.

The following table specifies attrib elements for this section.

 

Name Type Value Meaning

inverse_length

real

<value>

Specifies the weight for the inverse length rule. The inverse length rule gives URIs with few path segments (defined by the number of forward slashes) a higher score. URIs with 10 or more slashes receive a score of 0.

Default: 1.0

inverse_depth

real

<value>

Specifies the weight for the inverse depth rule. The number of page hops from a start URI is computed; a high score is assigned to URIs that have less than 10 page hops. The rule gives a score of zero for URIs with 10 or more page hops.

Default: 1.0

is_landing_page

real

<value>

Specifies the weight for the is_landing_page rule. This rule gives a URI that is considered a landing page a higher score. A landing page is a URI ending in one of /, /index.html, index.htm, index.php, index.jsp, index.asp, default.html, or default.htm.

The rule gives no score to URIs that have query components.

Default: 1.0

is_mime_markup

real

<value>

Specifies the weight for the is_mime_markup rule. This rule gives an additional score to pages whose MIME type is specified in the uri_search_mime configuration parameter in attrib.

Default: 1.0

change_history

real

<value>

Specifies the weight for the change history rule. This rule scores based on the HTTP header "last-modified" value over time. Web items that frequently change have a higher score than items that change less frequently.

Default: 10.0

sitemap

real

<value>

Specifies the weight for the sitemap rule. The score for the sitemap rule is specified in sitemap_weights.

Default: 10.0

<!-- Ranking weights. Each scoring criteria adds a score between -->
            <!-- 0.0 and 1.0 which is then multiplied with the associated    -->
            <!-- weight below. Use a weight of 0 to disable a scorer         -->
            <section name="weights">
                <!-- Score based on the number of /'es (segments) in the -->
                <!-- URI. Max score with one, no score with 10 or more   -->
                <attrib name="inverse_length" type="real"> 1.0 </attrib>

                <!-- Score based on the number of link "levels" down to -->
                <!-- this URI. Max score with none, no score with >= 10 -->
                <attrib name="inverse_depth" type="real"> 1.0 </attrib>

                <!-- Score added if URI is determined as a "landing page", -->
                <!-- defined as e.g. ending in "/" or "index.html". URIs   -->
                <!-- with query parameters are not given score             -->
                <attrib name="is_landing_page" type="real"> 1.0 </attrib>

                <!-- Score added if URI points to a markup document as    -->
                <!-- defined by the "uri_search_mime" option. Assumption  -->
                <!-- being that such content changes more often than e.g. -->
                <!-- "static" Word or PDF documents.                      -->
                <attrib name="is_mime_markup" type="real"> 1.0 </attrib>

                <!-- Score based on change history tracked over time by   -->
                <!-- using an estimator based on last modified date given -->
                <!-- by the web server. If no modified date returned then -->
                <!-- one is estimated (based on whether the document has  -->
                <!-- changed or not).                                     -->
                <attrib name="change_history" type="real"> 10.0 </attrib>
            </section>
  

In this section, <URL> entries in a sitemap can contain a changefreq element, which specifies how frequently a URI can be modified.

Valid string values for this element are as follows: always, hourly, daily, weekly, monthly, yearly, and never. The string values are converted into a numeric weight for adaptive crawling. The sitemap_weights section specifies a mapping of the string values to a numeric weight. This numeric weight is used to calculate the score to the sitemap score in the weights section.

The adaptive crawling score for a URI is calculated by multiplying the numeric weight by the sitemap configuration parameter weight.

The following table specifies attrib elements for this section.

importantImportant
The range of these elements must be from 0.0 to 1.0.

 

Name Type Value Meaning

always

real

<value>

Specifies the weight of the changefreq value always as a numeric value.

Default: 1.0

hourly

real

<value>

Specifies the weight of the changefreq value hourly as a numeric value.

Default: 0.64

daily

real

<value>

Specifies the weight of the changefreq value daily as a numeric value.

Default: 0.32

weekly

real

<value>

Specifies the weight of the changefreq value weekly as a numeric value.

Default: 0.16

monthly

real

<value>

Specifies the weight of the changefreq value monthly as a numeric value.

Default: 0.08

yearly

real

<value>

Specifies the weight of the changefreq value yearly as a numeric value.

Default: 0.04

never

real

<value>

Specifies the weight of the changefreq value never as a numeric value.

Default: 0.0

default

real

<value>

Specifies the weight for all URIs that are not associated with a <changefreq> value.

Default: 0.16

<section name="sitemap_weights">
    <attrib name="always" type="real"> 1.0 </attrib>
    <attrib name="hourly" type="real"> 0.64 </attrib>
    <attrib name="daily" type="real"> 0.32 </attrib>
    <attrib name="weekly" type="real"> 0.16 </attrib>
    <attrib name="monthly" type="real"> 0.08 </attrib>
    <attrib name="yearly" type="real"> 0.04 </attrib>
    <attrib name="never" type="real"> 0.0 </attrib>
    <attrib name="default" type="real"> 0.16 </attrib>
</section>

This section specifies configuration parameters that override the crawler's behavior of routing host names in a node scheduler. This parameter ensures that a group of host names is routed to the same node scheduler and site manager. This is useful when the use_cookies setting is enabled, because cookies are global only throughout a site manager process. Also, if you know certain Web sites are closely interlinked, you can reduce internal communication by clustering their host names.

The following table specifies attrib elements for this section.

 

Name Type value Meaning

name

list-string

Specifies a list of host names that should be aggregated to a node scheduler.

<section name="site_clusters">
    <attrib name="mycluster" type="list-string">
        <member> host1.constoso.com </member>
        <member> host2.constoso.com </member>
        <member> host3.constoso.com </member>
    </attrib>
</section>

This section limits the span of a crawl collection.

The following table specifies attrib elements for this section.

 

Name Type Value Meaning

mode

string

Specifies the depth of the crawling. Valid values are FULL or DEPTH:#, where # is the number of page hops from a start URI.

Default: FULL

fwdlinks

boolean

yes|no

Specifies whether to follow hyperlinks that point to a different host name.

Default: yes

fwdredirects

boolean

yes|no

Specifies whether to follow external HTTP redirects received from servers. External redirects are HTTP redirects that point from one host name to another host name.

Default: no

reset_level

boolean

yes|no

Specifies whether to reset the page hop counter use by mode when you follow a hyperlink to another host name.

Default: yes

        <section name="crawlmode">
            <attrib name="mode" type="string"> DEPTH:1 </attrib>
            <attrib name="fwdlinks" type="boolean"> yes </attrib>
            <attrib name="fwdredirects" type="boolean"> yes </attrib>
            <attrib name="reset_level" type="boolean"> no </attrib>
        </section>

This section is used to submit content to HTTP POST requests. The content is submitted to URIs that match an URI prefix or that exactly match a URI.

The following table specifies attrib elements for this section.

 

Name

Type

Value

Meaning

name

string

Specifies the payload content string. This string is posted to URIs that matches a URI or prefix set by the name XML attribute.

The section requires a match if the name attribute specifies a URI.

To specify a URI prefix, the label prefix: must be used. Then the leading part of a URI specifies the rest of the match.

<section name="post_payload">
    <attrib name="prefix:http://www.contoso.com/secure" type="string"> variable1=value1&amp;variableB=valueB </attrib>
</section>

This section initializes and configures RSS feed support in a crawl collection.

The following table specifies attrib elements for this section.

 

Name

Type

Value

Meaning

start_uris

list-string

Specifies a list of start URIs that point to RSS feed items.

start_uri_files

list-string

Specifies a list of paths to files that contain URIs that point to RSS feed items. The format of these files must be plain text files that have one URI per line.

auto_discover

boolean

yes|no

Specifies whether the Web crawler should discover new RSS feeds. If this option is not set, only feeds specified in the RSS start URIs and the RSS start URIs files sections will be treated as RSS feeds.

Default: no

follow_links

boolean

yes|no

Specifies that the Web crawler should follow hyperlinks from Web items found in the RSS feed, which is the usual Web crawler behavior. If disabled, crawling happens only one hop away from a feed. Disable this option to only crawl feeds and Web items referenced by feeds.

Default: yes

ignore_rules

boolean

yes|no

Specifies that the Web crawler should crawl all Web items referenced by the RSS feed, regardless of their inclusion in the include/exclude rules, as specified in include_domains, exclude_domains, include_uris, and exclude_uris.

Default: no

index_feed

boolean

yes|no

Specifies whether the Web crawler should send RSS feeds themselves to the indexing engine, or only the Web items hyperlinked within the feeds.

Default: no

del_expired_links

boolean

yes|no

Specifies whether the Web crawler should delete items from the RSS feed when they expire, as defined by max_link_age and max_link_count.

Default: no

max_link_age

integer

<value>

Specifies the maximum age, in minutes, for a Web item found in an RSS feed. Only applies if the del_expired_links configuration parameter is set to yes.

Default: 0

max_link_count

integer

<value>

Specifies the maximum number of hyperlinks the Web crawler saves for an RSS feed. If the Web crawler encounters more hyperlinks, they expire in a first-in-first-out order. Only applies if del_expired_links configuration parameter is set to yes.

Default: 128

        <section name="rss">
            <!-- Attempt to discover new rss feeds, yes/no                  -->
            <attrib name="auto_discover" type="boolean"> yes </attrib>
            <attrib name="del_expired_links" type="boolean"> yes </attrib>
            <attrib name="follow_links" type="boolean"> yes </attrib>
            <attrib name="ignore_rules" type="boolean"> no </attrib>
            <attrib name="index_feed" type="boolean"> no </attrib>
            <attrib name="max_link_age" type="integer"> 0 </attrib>
            <attrib name="max_link_count" type="integer"> 128 </attrib>
            <attrib name="start_uris" type="list-string">
                <member> http://www.startsiden.no/rss.rss </member>
            </attrib>
            <!-- Start uri files (optional)                                 -->
            <attrib name="start_uri_files" type="list-string">
                <member> /usr/fast/etc/rss_seedlist.txt </member>
            </attrib>
        </section>

This section specifies at least one logins section element for HTML forms-based authentication. These are associated with specific Web site logins, each of which must contain a unique login name in the name attribute.

The following table specifies attrib elements for this section.

 

Name Type Value Meaning

preload

string

<value>

Specifies the full URI of the page to retrieve before processing the login form.

scheme

string

http|https

Specifies the URI scheme of the login Web site.

Valid values: http or https

site

string

<value>

Specifies the host name of the login form page.

form

string

<value>

Specifies the path of the login form.

action

string

GET|POST

Specifies whether the form uses HTTP POST or HTTP GET.

Valid values are as follows: GET or POST

sites

list-string

<value>

Specifies a list of Web sites or host names that the Web crawler should log on to before it begins the crawl process.

ttl

integer

<seconds>

Specifies the time, in seconds, that can elapse before requiring another login to continue the crawl.

html_form

string

<value>

Specifies the URI to the HTML page that contains the login form.

autofill

boolean

yes|no

Specifies whether the Web crawler should try to automatically fill out the HTML login form. The html_form configuration parameter must be specified if this attribute is set to yes.

relogin_if_failed

boolean

yes|no

Specifies whether the Web crawler can attempt to re-log on to the Web site after ttl seconds if the login failed.

You can use Login elements as an alternative to the logins section.

        <section name="logins">
            <section name="mytestlogin">
                <!-- Instructs the crawler to "preload" potential cookies by -->
                <!-- fetching this page and register any cookies before      -->
                <!-- proceeding with login                                   -->
                <attrib name="preload" type="string">http://preload.contoso.com/</attrib>
                <attrib name="scheme" type="string"> https </attrib>
                <attrib name="site" type="string"> login.contoso.com </attrib>
                <attrib name="form" type="string"> /path/to/some/form.cgi </attrib> 
                <attrib name="action" type="string">POST</attrib> 
                <section name="parameters"> 
                    <attrib name="user" type="string"> username </attrib>
                    <attrib name="password" type="string"> password </attrib>
                    <attrib name="target" type="string"> sometarget </attrib>
                </section> 
                <!-- Host names of sites requiring this login to crawl -->
                <attrib name="sites" type="list-string"> 
                    <member> site1.contoso.com </member> 
                    <member> site2.contoso.com </member> 
                </attrib> 
                <!-- Time to live for login cookie. Will re-log in when expires -->
                <attrib name="ttl" type="integer"> 7200 </attrib> 
            </section>
        </section>

This section sets the authentication credentials that are used in a HTML form. It must be specified in a site logins section, or in a Login element. The credential parameters are typically different for each HTML form.

If the autofill configuration parameter is enabled, only variables that are visible in the browser are specified. For example: username and password or equivalent. In this case, the Web crawler must retrieve the HTML page and read any "hidden" variables that are required to submit the form. A variable value that is specified in the configuration parameters will override any value that was stored in the form.

The following table specifies attrib elements for this section.

 

Name Type Value Meaning

name

The name XML attribute contains the variable of the HTML form to set.

string

Specifies the values of the HTML form variable.

<section name="parameters"> 
                <attrib name="user" type="string"> username </attrib>
                <attrib name="password" type="string"> password </attrib>
                <attrib name="target" type="string"> sometarget </attrib>
            </section> 

This section specifies the configuration of crawl sub collections. The subdomains section must contain at least one section XML element, each of which specifies a crawl sub collection. A crawl sub collection section must contain a unique name by setting the name attribute

Instead of a subdomains section, you can use a SubDomain element.

You must specify include/exclude rules to limit the scope of a crawl sub collection. These include/exclude rules are as follows: include_domains, exclude_domains, include_uris and exclude_uris.

Only a sub-set of the configuration parameters specified in attrib can be used in a sub-section. These configuration parameters are as follows:

accept_compression

allowed_schemes

crawlmode

cut_off

delay

ftp_passive

headers

max_doc

proxy

refresh

refresh_mode

start_uris

start_uri_files

use_http_1_1

use_javascript

use_sitemaps

The refresh configuration parameters of a crawl sub collection must be set lower than the refresh rate of the main crawl collection. The use_javascript, use_sitemaps, and max_doc configuration parameters cannot be used if the include_uris or exclude_uris settings are used to specify the crawl sub collection.

In addition, you can use the rss and the variable_delay sections in a crawl sub collection.

<?xml version="1.0" encoding="utf-8"?>
<CrawlerConfig>
   <DomainSpecification name="subcollection_example">
      <section name="subdomains">
         <section name="subdomain_1">
            <section name="include_uris">
               <attrib name="prefix" type="list-string">
                  <member> http://www.contoso.com/index </member>
               </attrib>
            </section>
            <attrib name="refresh" type="real"> 60.0 </attrib>
            <attrib name="delay" type="real"> 10.0 </attrib>
            <attrib name="start_uris" type="list-string">
               <member> http://www.contoso.com/ </member>
            </attrib>
         </section>
      </section>
</DomainSpecification>
</CrawlerConfig>

This element specifies the configuration of crawl sub collections. A crawl sub collection is an object that differentiates crawl collection members from one another by their definitions. A crawl collection can contain multiple SubDomain elements.

The configuration parameters for a SubDomain element are specified in subdomains.

A SubDomain element contains attrib elements and section elements.

 

Attribute Value Meaning

name

<name>

A string specifying the name of the crawl sub collection.

<?xml version="1.0" encoding="utf-8"?>
<CrawlerConfig>
   <DomainSpecification name="subcollection_example">
      <SubDomain name="subdomain_1">
         <section name="include_uris">
            <attrib name="prefix" type="list-string">
               <member> http://www.contoso.com/index </member>
            </attrib>
         </section>
         <attrib name="refresh" type="real"> 60.0 </attrib>
         <attrib name="delay" type="real"> 10.0 </attrib>
         <attrib name="start_uris" type="list-string">
            <member> http://www.contoso.com/ </member>
         </attrib>
      </SubDomain>
   </DomainSpecification>
</CrawlerConfig>

This element is used for HTML forms-based authentication. The configuration parameters for a Login element are specified in logins. A crawl collection can contain multiple Login elements. A Login element contains attrib elements and section elements.

 

Attribute Value Meaning

name

<value>

A string specifying the name of the login specification.

<?xml version="1.0" encoding="utf-8"?>
<CrawlerConfig>
   <DomainSpecification name="login_example">
      <Login name="mytestlogin">
         <attrib name="preload" type="string">http://preload.contoso.com/
         </attrib>
         <attrib name="scheme" type="string"> https </attrib>
         <attrib name="site" type="string"> login.contoso.com  </attrib>
         <attrib name="form" type="string"> /path/to/some/form.cgi </attrib>
         <attrib name="action" type="string">POST</attrib>
         <section name="parameters">
            <attrib name="user" type="string"> username </attrib>
            <attrib name="password" type="string"> password </attrib>
         </section>
         <attrib name="sites" type="list-string">
            <member> site1.contoso.com  </member>
            <member> site2.contoso.com  </member>
         </attrib>
         <attrib name="ttl" type="integer"> 7200 </attrib>
         <attrib name="html_form" type="string">
            http://login.contoso.com/login.html 
         </attrib>
         <attrib name="autofill" type="boolean"> yes </attrib>
         <attrib name="relogin_if_failed" type="boolean"> yes </attrib>
      </Login>
   </DomainSpecification>
</CrawlerConfig>

This element is used to override configuration parameters in a crawl collection or a crawl sub collection for a particular node scheduler. The configuration parameters for a Node element are specified in SubDomain, Login, attrib and section.

A Node element contains attrib elements and section elements.

 

Attribute Value Meaning

name

<value>

A string specifying the node scheduler for these configuration parameters.

The following example uses a multi-node installation. One of the node schedulers is named "crawler_node1". This configures the "crawler_node1" with a different delay configuration parameter than the other nodes.

<?xml version="1.0" encoding="utf-8"?>
<CrawlerConfig>
   <DomainSpecification name="node_example ">
      <attrib name="delay" type="real"> 60.0 </attrib>
      <Node name="crawler_node1">
         <attrib name="delay" type="real"> 90.0 </attrib>
      </Node>
   </DomainSpecification>
</CrawlerConfig>

A Web crawler configuration file must be formatted according to the following XML schema:

<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xs:element name="CrawlerConfig" type="CT_CrawlerConfig"/>
  
  <xs:complexType name="CT_CrawlerConfig >
    <xs:choice minOccurs="0" maxOccurs="unbounded">
      <xs:element name="DomainSpecification" type="CT_DomainSpecification"/>
    </xs:choice>
  </xs:complexType>

  <xs:complexType name="CT_DomainSpecification">
    <xs:choice minOccurs="0" maxOccurs="unbounded">
      <xs:element name="attrib" type="CT_attrib" maxOccurs="unbounded"/>
      <xs:element name="section" type="CT_section"/>
      <xs:element name="SubDomain" type="CT_SubDomain"/>
      <xs:element name="Login" type="CT_Login"/>
      <xs:element name="Node" type="CT_Node"/>
    </xs:choice>
    <xs:attribute name="name" type="xs:string" use="required"/>
  </xs:complexType>

  <xs:complexType name="CT_attrib" mixed="true">
    <xs:sequence minOccurs="0" maxOccurs="unbounded">
      <xs:element name="member" type="ST_member"/>
    </xs:sequence>
    <xs:attribute name="name" type="xs:string" use="required"/>
    <xs:attribute name="type" type="ST_type" use="required"/>
  </xs:complexType>

  <xs:complexType name="CT_section">
    <xs:choice minOccurs="0" maxOccurs="unbounded">
        <xs:element name="attrib" type="CT_attrib"/>
        <xs:element name="section" type="CT_section"/>
    </xs:choice>
    <xs:attribute name="name" type="xs:string" use="required"/>
  </xs:complexType>

  <xs:complexType name="CT_SubDomain">
    <xs:choice minOccurs="0" maxOccurs="unbounded">
      <xs:element name="attrib" type="CT_attrib"/>
      <xs:element name="section" type="CT_section"/>
    </xs:choice>
    <xs:attribute name="name" type="xs:string" use="required"/>
  </xs:complexType>

  <xs:complexType name="CT_Login">
    <xs:choice minOccurs="0" maxOccurs="unbounded">
      <xs:element name="attrib" type="CT_attrib"/>
      <xs:element name="section" type="CT_section"/>
    </xs:choice>
    <xs:attribute name="name" type="xs:string" use="required"/>
  </xs:complexType>

  <xs:complexType name="CT_Node">
    <xs:choice minOccurs="0" maxOccurs="unbounded">
      <xs:element name="attrib" type="CT_attrib"/>
      <xs:element name="section" type="CT_section"/>
    </xs:choice>
    <xs:attribute name="name" type="xs:string" use="required"/>
  </xs:complexType>
  
  <xs:simpleType name="ST_type">
    <xs:restriction base="xs:string">
      <xs:enumeration value="boolean"/>
      <xs:enumeration value="string"/>
      <xs:enumeration value="integer"/>
      <xs:enumeration value="list-string"/>
      <xs:enumeration value="real"/>
    </xs:restriction>
  </xs:simpleType>

  <xs:simpleType name="ST_member">
    <xs:restriction base="xs:string"></xs:restriction>
  </xs:simpleType>
</xs:schema>

The following example configures a simple Web crawler configuration. It is configured to crawl only the contoso.com Web site.

<?xml version="1.0" encoding="utf-8"?>
<CrawlerConfig>
    <DomainSpecification name="default_example">
        <section name="crawlmode">
            <attrib name="fwdlinks" type="boolean"> no </attrib>
            <attrib name="fwdredirects" type="boolean"> no </attrib>
            <attrib name="mode" type="string"> FULL </attrib>
            <attrib name="reset_level" type="boolean"> no </attrib>
        </section>
        <attrib name="start_uris" type="list-string">
            <member> http://www.contoso.com </member>
        </attrib>
    </DomainSpecification>
</CrawlerConfig>

The following example crawler configuration contains some common configuration parameters.

<?xml version="1.0" encoding="utf-8"?>
<CrawlerConfig>
    <DomainSpecification name="default_example">
        <attrib name="accept_compression" type="boolean"> yes </attrib>
        <attrib name="allowed_schemes" type="list-string">
            <member> http </member>
            <member> https </member>
        </attrib>
        <attrib name="allowed_types" type="list-string">
            <member> text/html </member>
            <member> text/plain </member>
        </attrib>
        <section name="cachesize">
            <attrib name="aliases" type="integer"> 1048576 </attrib>
            <attrib name="pp" type="integer"> 1048576 </attrib>
            <attrib name="pp_pending" type="integer"> 131072 </attrib>
            <attrib name="routetab" type="integer"> 1048576 </attrib>
        </section>
        <attrib name="check_meta_robots" type="boolean"> yes </attrib>
        <attrib name="cookie_timeout" type="integer"> 900 </attrib>
        <section name="crawlmode">
            <attrib name="fwdlinks" type="boolean"> yes </attrib>
            <attrib name="fwdredirects" type="boolean"> yes </attrib>
            <attrib name="mode" type="string"> FULL </attrib>
            <attrib name="reset_level" type="boolean"> no </attrib>
        </section>
        <attrib name="csum_cut_off" type="integer"> 0 </attrib>
        <attrib name="cut_off" type="integer"> 5000000 </attrib>
        <attrib name="dbswitch" type="integer"> 5 </attrib>
        <attrib name="dbswitch_delete" type="boolean"> no </attrib>
        <attrib name="delay" type="real"> 60.0 </attrib>
        <attrib name="domain_clustering" type="boolean"> no </attrib>
        <attrib name="enforce_delay_per_ip" type="boolean"> yes </attrib>
        <attrib name="exclude_exts" type="list-string">
            <member> .jpg </member>
            <member> .jpeg </member>
            <member> .ico </member>
            <member> .tif </member>
            <member> .png </member>
            <member> .bmp </member>
            <member> .gif </member>
            <member> .wmf </member>
            <member> .avi </member>
            <member> .mpg </member>
            <member> .wmv </member>
            <member> .wma </member>
            <member> .ram </member>
            <member> .asx </member>
            <member> .asf </member>
            <member> .mp3 </member>
            <member> .wav </member>
            <member> .ogg </member>
            <member> .ra </member>
            <member> .aac </member>
            <member> .m4a </member>
            <member> .zip </member>
            <member> .gz </member>
            <member> .vmarc </member>
            <member> .z </member>
            <member> .tar </member>
            <member> .iso </member>
            <member> .img </member>
            <member> .rpm </member>
            <member> .cab </member>
            <member> .rar </member>
            <member> .ace </member>
            <member> .hqx </member>
            <member> .swf </member>
            <member> .exe </member>
            <member> .java </member>
            <member> .jar </member>
            <member> .prz </member>
            <member> .wrl </member>
            <member> .midr </member>
            <member> .css </member>
            <member> .ps </member>
            <member> .ttf </member>
            <member> .mso </member>
            <member> .dvi </member>
        </attrib>
        <attrib name="extract_links_from_dupes" type="boolean"> no </attrib>
        <attrib name="fetch_timeout" type="integer"> 300 </attrib>
        <attrib name="force_mimetype_detection" type="boolean"> no </attrib>
        <section name="ftp_errors">
            <attrib name="4xx" type="string"> DELETE:3 </attrib>
            <attrib name="550" type="string"> DELETE:0 </attrib>
            <attrib name="5xx" type="string"> DELETE:3 </attrib>
            <attrib name="int" type="string"> KEEP:0 </attrib>
            <attrib name="net" type="string"> DELETE:3, RETRY:1 </attrib>
            <attrib name="ttl" type="string"> DELETE:3 </attrib>
        </section>
        <attrib name="headers" type="list-string">
            <member> User-Agent: FAST Enterprise Crawler 6 </member>
        </attrib>
        <attrib name="html_redir_is_redir" type="boolean"> yes </attrib>
        <attrib name="html_redir_thresh" type="integer"> 3 </attrib>
        <section name="http_errors">
            <attrib name="4xx" type="string"> DELETE:0 </attrib>
            <attrib name="5xx" type="string"> DELETE:10 </attrib>
            <attrib name="int" type="string"> KEEP:0 </attrib>
            <attrib name="net" type="string"> DELETE:3, RETRY:1 </attrib>
            <attrib name="ttl" type="string"> DELETE:3 </attrib>
        </section>
        <attrib name="if_modified_since" type="boolean"> yes </attrib>
        <attrib name="javascript_keep_html" type="boolean"> no </attrib>
        <section name="limits">
            <attrib name="disk_free" type="integer"> 0 </attrib>
            <attrib name="disk_free_slack" type="integer"> 3 </attrib>
            <attrib name="max_doc" type="integer"> 0 </attrib>
            <attrib name="max_doc_slack" type="integer"> 1000 </attrib>
        </section>
        <section name="link_extraction">
            <attrib name="a" type="boolean"> yes </attrib>
            <attrib name="action" type="boolean"> yes </attrib>
            <attrib name="area" type="boolean"> yes </attrib>
            <attrib name="card" type="boolean"> yes </attrib>
            <attrib name="comment" type="boolean"> no </attrib>
            <attrib name="embed" type="boolean"> no </attrib>
            <attrib name="frame" type="boolean"> yes </attrib>
            <attrib name="go" type="boolean"> yes </attrib>
            <attrib name="img" type="boolean"> no </attrib>
            <attrib name="layer" type="boolean"> yes </attrib>
            <attrib name="link" type="boolean"> yes </attrib>
            <attrib name="meta" type="boolean"> yes </attrib>
            <attrib name="meta_refresh" type="boolean"> yes </attrib>
        </section>
        <section name="log">
            <attrib name="dsfeed" type="string"> text </attrib>
            <attrib name="fetch" type="string"> text </attrib>
            <attrib name="postprocess" type="string"> text </attrib>
            <attrib name="site" type="string"> text </attrib>
        </section>
        <attrib name="login_failed_ignore" type="boolean"> no </attrib>
        <attrib name="login_timeout" type="integer"> 300 </attrib>
        <attrib name="max_backoff_counter" type="integer"> 50 </attrib>
        <attrib name="max_backoff_delay" type="integer"> 600 </attrib>
        <attrib name="max_doc" type="integer"> 1000000 </attrib>
        <attrib name="max_pending" type="integer"> 2 </attrib>
        <attrib name="max_redirects" type="integer"> 10 </attrib>
        <attrib name="max_reflinks" type="integer"> 0 </attrib>
        <attrib name="max_sites" type="integer"> 128 </attrib>
        <attrib name="max_uri_recursion" type="integer"> 5 </attrib>
        <attrib name="mufilter" type="integer"> 0 </attrib>
        <attrib name="near_duplicate_detection" type="boolean"> no </attrib>
        <attrib name="obey_robots_delay" type="boolean"> no </attrib>
        <section name="pp">
            <attrib name="ds_max_ecl" type="integer"> 10 </attrib>
            <attrib name="ds_meta_info" type="list-string">
                <member> duplicates </member>
                <member> redirects </member>
                <member> mirrors </member>
                <member> metadata </member>
            </attrib>
            <attrib name="ds_paused" type="boolean"> no </attrib>
            <attrib name="ds_send_links" type="boolean"> no </attrib>
            <attrib name="max_dupes" type="integer"> 10 </attrib>
            <attrib name="stripe" type="integer"> 1 </attrib>
        </section>
        <section name="ppdup">
            <attrib name="compact" type="boolean"> yes </attrib>
        </section>
        <attrib name="proxy_max_pending" type="integer"> 2147483647 </attrib>
        <attrib name="refresh" type="real"> 1440.0 </attrib>
        <attrib name="refresh_mode" type="string"> scratch </attrib>
        <attrib name="refresh_when_idle" type="boolean"> no </attrib>
        <attrib name="robots" type="boolean"> yes </attrib>
        <attrib name="robots_auth_ignore" type="boolean"> yes </attrib>
        <attrib name="robots_timeout" type="integer"> 300 </attrib>
        <attrib name="robots_tout_ignore" type="boolean"> no </attrib>
        <attrib name="robots_ttl" type="integer"> 86400 </attrib>
        <section name="rss">
            <attrib name="auto_discover" type="boolean"> no </attrib>
            <attrib name="del_expired_links" type="boolean"> no </attrib>
            <attrib name="follow_links" type="boolean"> no </attrib>
            <attrib name="ignore_rules" type="boolean"> no </attrib>
            <attrib name="index_feed" type="boolean"> no </attrib>
            <attrib name="max_link_age" type="integer"> 0 </attrib>
            <attrib name="max_link_count" type="integer"> 128 </attrib>
        </section>
        <attrib name="smfilter" type="integer"> 0 </attrib>
        <attrib name="sort_query_params" type="boolean"> no </attrib>
        <attrib name="start_uris" type="list-string">
            <member> http://www.contoso.com </member>
        </attrib>
        <section name="storage">
            <attrib name="clusters" type="integer"> 8 </attrib>
            <attrib name="compress" type="boolean"> yes </attrib>
            <attrib name="compress_exclude_mime" type="list-string">
                <member> application/x-shockwave-flash </member>
            </attrib>
            <attrib name="datastore" type="string"> bstore </attrib>
            <attrib name="defrag_threshold" type="integer"> 85 </attrib>
            <attrib name="remove_docs" type="boolean"> no </attrib>
            <attrib name="store_dupes" type="boolean"> no </attrib>
            <attrib name="store_http_header" type="boolean"> yes </attrib>
        </section>
        <attrib name="truncate" type="boolean"> no </attrib>
        <attrib name="umlogs" type="boolean"> yes </attrib>
        <attrib name="uri_search_mime" type="list-string">
            <member> text/html </member>
            <member> text/vnd.wap.wml </member>
            <member> text/wml </member>
            <member> text/x-wap.wml </member>
            <member> x-application/wml </member>
            <member> text/x-hdml </member>
        </attrib>
        <attrib name="use_cookies" type="boolean"> no </attrib>
        <attrib name="use_http_1_1" type="boolean"> yes </attrib>
        <attrib name="use_javascript" type="boolean"> no </attrib>
        <attrib name="use_meta_csum" type="boolean"> no </attrib>
        <attrib name="use_sitemaps" type="boolean"> no </attrib>
        <section name="workqueue_priority">
            <attrib name="default" type="integer"> 1 </attrib>
            <attrib name="levels" type="integer"> 1 </attrib>
            <attrib name="pop_scheme" type="string"> default </attrib>
            <attrib name="start_uri_pri" type="integer"> 1 </attrib>
        </section>
   </DomainSpecification>
</CrawlerConfig>

Was this page helpful?
(1500 characters remaining)
Thank you for your feedback
Show:
© 2014 Microsoft