Use crawl rules to specify a different content access account or authentication method (Search Server 2008)

Applies To: Microsoft Search Server 2008

 

Topic Last Modified: 2009-04-06

Note

Unless otherwise noted, the information in this article applies to both Microsoft Search Server 2008 and Microsoft Search Server 2008 Express.

In Microsoft Search Server 2008, you can create new crawl rules or edit existing crawl rules to specify a different content access account or authentication method to use when crawling a particular path. You can also specify the order in which crawl rules are applied.

In this article:

  • Crawling sites that use forms-based authentication

  • Create a crawl rule

  • Edit a crawl rule

  • Delete a crawl rule

  • Reorder crawl rules

Crawling sites that use forms-based authentication

Search Server 2008 supports crawling sites that use forms-based authentication (FBA), when FBA is implemented by using an input Submit type. Search Server 2008 does not support crawling content on sites that have logon pages that contains a series of forms that span multiple pages (wizard-based forms), or forms that use dynamic content rendered by using AJAX, JavaScript, or other dynamic scripting methods. Sites that use the following types of forms-based authentication forms are not supported:

  • Wizard-style logon pages   Search Server 2008 does not crawl sites that use a series of screens to authenticate users. These wizard-style forms present one or more pages based on the user's input in a form on a previous page. Because Search Server 2008 cannot crawl multiple logon pages, the creation of a crawl rule for a site that uses this type of authentication is not supported.

  • Logon forms that change dynamically   Search Server 2008 does not crawl sites that have logon pages that change dynamically, because they are designed to use technologies such as AJAX. A logon screen that uses AJAX can present new options to a user without a visible postback — in other words, scripting enables the display of new data without the need to refresh the page in the browser. When a user interacts with a logon page that uses this technology, he or she might type a password, and then be presented with a new form to answer a security question, without seeing the page refresh in the browser. The creation of a crawl rule for a site that uses this type of design is not supported.

Note   Before you perform the procedures in this article, confirm that you have read the topic Configure how the crawler authenticates (Search Server 2008).

Important

You must be a search services administrator to perform the following procedures. For more information, see Add or remove a search services administrator (Search Server 2008).

Create a crawl rule

Use the following procedure to create a crawl rule that specifies a different content access account or authentication method to use when crawling a particular path.

Create a crawl rule

  1. On the Search Administration page, in the Crawling section, click Crawl rules.

  2. On the Manage Crawl Rules page, click New Crawl Rule.

  3. On the Add Crawl Rule page, in the Path section, in the Path box, type the path affected by this rule. You can use standard wildcard characters in the path. For example, you can type:

    • http://server1/folder* to include all Web resources with a URL that starts with http://server1/folder.

    • *://*.txt to include every document with the .txt file name extension.

  4. In the Crawl Configuration section, to prevent a folder or subsite in the path from being crawled, click Exclude all items in this path.

  5. To select whether items in the path are included, click Include all items in this path, and then select any combination of the following check boxes:

    • **Follow links on the URL without crawling the URL itself   **Select this check box if you want the links on the logon page to be crawled, but you do not want the text on the logon page to be indexed.

    • **Crawl complex URLs (URLs that contain a question mark (?))   **Select this check box if you want to crawl URLs that use parameters to display additional content.

    • Crawl SharePoint content as Http pages   Normally, content on SharePoint sites is crawled by using a special protocol. Select this check box if you want content on SharePoint sites to be crawled as HTTP pages instead.

    Note

    When the content is crawled by using the HTTP protocol, the item permissions are not stored.

  6. In the Specify Authentication section, do one of the following:

    Note

    To select any of the options in this section, make sure to click Include all items in this path under Crawl configuration.

    • To use the default content access account when crawling URLs affected by this crawl rule, click Use the default content access account.

    • If you want to use a different content access account, click Specify a different content access account, and then do the following:

      In the Account box, type the account name that can access the paths defined by this crawl rule — for example, user_name or DOMAIN\user_name.

      In the Password and Confirm Password boxes, type the password for the account.

      If you want to prevent Basic authentication from being used, select the Do not allow Basic Authentication check box. Otherwise, if you want to use Basic authentication, clear the Do not allow Basic Authentication check box.

      Note

      You cannot use Basic authentication if the domain account assigned to the content access account that is used to crawl the content affected by this crawl rule is from a different domain than your server farm.

    • To use a client certificate for authentication, click Specify client certificate, and then on the Certificate menu, click a certificate.

    • To use forms-based authentication, click Specify forms credentials, type the form location in the Form URL box, and then click Enter Credentials.

      Note

      Search Server 2008 supports crawling sites that use forms-based authentication (FBA), when FBA is implemented by using an input Submit type. Search Server 2008 does not support crawling content on sites whose logon pages contain a series of forms that span multiple pages (wizard-based forms) or forms that use dynamic content rendered by using AJAX, JavaScript, or other dynamic scripting methods.

    • To use cookie authentication, click Use cookie for crawling, and then do one of the following:

      • To obtain a cookie from a URL, type the full location in the Obtain cookie from a URL box, and then click Get Cookie.

      • To select a specific cookie from your computer or your network, click Specify cookie for crawling, click Browse, and then select the cookie to be used.

      • To specify the error pages that display when a cookie is expired, in the Error Pages (semi-colon delimited) box, type the URLs for the pages, separated by semi-colons.

  7. Click OK.

Edit a crawl rule

You can edit an existing crawl rule at any time by going to the Manage Crawl Rules page, clicking the crawl rule, and then making the necessary changes to the path and configuration, as described in the previous procedure.

Delete a crawl rule

Use the following procedure to delete a crawl rule that you no longer need.

To delete a crawl rule

  1. On the Search Administration page, in the Crawling section, click Crawl rules.

  2. On the Manage Crawl Rules page, point to the crawl rule that you want to delete, click the arrow that appears, and then click Delete on the menu that appears.

  3. Click OK to confirm the deletion.

Reorder crawl rules

After you create new crawl rules or edit existing ones, we recommend that you specify the order in which you want the rules to be applied when content is crawled. Crawl rules are applied in the order in which they are listed. Therefore, if two rules cover the same or overlapping content, the first rule that is listed is applied. Use the following procedure to specify the order of your crawl rules.

Reorder crawl rules

  1. On the Search Administration page, in the Crawling section, click Crawl rules.

  2. On the Manage Crawl Rules page, in the Order column in the list of crawl rules, select a value in the list that specifies the position you want the rule to occupy. Other values are shifted accordingly.