Use crawl rules to determine what content gets crawled (Search Server 2008)

Applies To: Microsoft Search Server 2008

 

Topic Last Modified: 2009-08-10

Note

Unless otherwise noted, the information in this article applies to both Microsoft Search Server 2008 and Microsoft Search Server 2008 Express.

In this article:

  • Create a crawl rule

  • Edit a crawl rule

  • Delete a crawl rule

  • Reorder crawl rules

Before you perform these procedures, confirm that:

Important

You must be a search services administrator to perform the procedures in this article. For more information, see Add or remove a search services administrator (Search Server 2008).

You can create new crawl rules or edit existing crawl rules to determine what content gets crawled. You can also reorder crawl rules to specify the order in which these rules are applied.

Create a crawl rule

Use the following procedure to create a crawl rule.

Create a crawl rule

  1. On the Search Administration page, in the Crawling section, click Crawl rules.

  2. On the Manage Crawl Rules page, click New Crawl Rule.

  3. On the Add Crawl Rule page, in the Path section, in the Path box, type the path affected by this rule. You can use standard wildcard characters in the path. For example:

    • http://server1/folder* contains all Web resources with a URL that starts with http://server1/folder.

    • *://*.txt includes every document with the txt file extension.

  4. In the Crawl Configuration section, select one of the following:

    • Exclude all items in this path. Select this option if you want all items in the specified path to be excluded from the crawl.

    • Include all items in this path. Select this option if you want all items in the path to be crawled.

  5. If you chose to exclude all items in this path, skip to step 7. Otherwise, you can further refine the inclusion by selecting any combination of the following:

    • Follow links on the URL without crawling the URL itself. Select this option if you want to crawl links contained within the URL, but not the URL itself.

    • Crawl complex URLs (URLs that contain a question mark (?)). Select this option if you want to crawl URLs that contain parameters that use the question mark (?) notation.

    • Crawl SharePoint content as HTTP pages. Normally, SharePoint content is crawled by using a special protocol. Select this option if you want SharePoint content to be crawled as HTTP pages instead. When the content is crawled by using the HTTP protocol, item permissions are not stored. This means that all items that match a particular search query appear on search results pages, regardless of whether the user that initiated the query has access to those items.

      The purpose of this setting is to enable search administrators to crawl remote SharePoint sites that they do not have explicit control over and therefore cannot enforce that the domain account used to crawl those remote sites has been granted full-read permissions on those sites.

    Note

    For information about the settings in the Specify Authentication section, see Use crawl rules to specify a different content access account or authentication method (Search Server 2008)

  6. Click OK.

  7. Repeat steps 2 through 5 for each new crawl rule you want to create.

Edit a crawl rule

You can edit an existing crawl rule at any time by clicking it, and then making the necessary changes to the path and configuration, as described in the previous procedure.

Note

This will require a full crawl of the content impacted by the altered crawl rule.

Delete a crawl rule

Use the following procedure to delete a crawl rule that is no longer needed.

Delete a crawl rule

  1. On the Shared Services Administration page, in the Search section, click Search settings.

  2. On the Configure Search Settings page, in the Crawl Settings section, click Crawl rules.

  3. On the Manage Crawl Rules page, point to the crawl rule that you want to delete, click the arrow that appears, and then click Delete on the menu that appears.

  4. Click OK to confirm the deletion.

Note

This will require a full crawl of the content impacted by the deleted crawl rule.

Reorder crawl rules

After you create new crawl rules, we recommend that you specify the order in which you want the rules to be applied while content is being crawled. Crawl rules are applied in the order in which they are listed. Therefore, if two rules cover the same or overlapping content, the first rule that is listed is applied. Use the following procedure to specify the order of your crawl rules.

Reorder crawl rules

  • On the Shared Services Administration page, in the Search section, click Search settings.

  • On the Configure Search Settings page, in the Crawl Settings section, click Crawl rules.

  • On the Manage Crawl Rules page, in the Order column in the list of crawl rules, select a value in the list that specifies the position you want the rule to occupy. Other values are shifted accordingly.

    You can also use a global exclusion rule, which applies regardless of the order in which it is listed. For more information about administering crawl rules, see the Administrating Crawl Rules section in the following resource: Book Excerpt - Chapter 16 Enterprise search and indexing architecture and administration.

Note

This will require a full crawl of the content that is affected by the repositioned crawl rule.