Share via


Plan for crawling and federation (Search Server 2010)

 

Applies to: Search Server 2010

Topic Last Modified: 2013-02-03

Before end-users can use the enterprise search functionality in Microsoft Search Server 2010, you must crawl or federate the content that you want to make available for users to search. Planning to crawl or federate includes the following tasks:

  • Plan content sources

  • Plan file-type inclusions and IFilters

  • Plan for authentication

  • Plan connectors

  • Plan to manage the impact of crawling

  • Plan crawl rules

  • Plan search settings that are managed at the farm level

  • Plan for federation

Plan content sources

A content source is a set of options that you can use to specify what type of content is crawled, what URLs to crawl, and how deep and when to crawl. The default content source is Local SharePoint sites. You can use this content source to specify how to crawl all content in all Web applications that are associated with a particular Search service application. By default, for each Web application that uses a particular Search service application, Search Server 2010 adds the start address of the top-level site of each site collection to the default content source.

Some organizations can use the default content source to satisfy their search requirements. However, many organizations have to have additional content sources. Plan additional content sources when you have to do the following:

  • Crawl different types of content — for example, SharePoint sites, file shares, and business data.

  • Crawl some content on different schedules than other content.

  • Limit or increase the quantity of content that is crawled.

  • Set different priorities for crawling different sites.

You can create up to 500 content sources in each Search service application, and each content source can contain as many as 500 start addresses. To keep administration as simple as possible, we recommend that you limit the number of content sources that you create.

Plan to crawl different kinds of content

You can only crawl one kind of content per content source. That is, you can create a content source that contains start addresses for SharePoint sites and another content source that contains start addresses for file shares. However, you cannot create a single content source that contains start addresses to both SharePoint sites and file shares. The following table lists the kinds of content sources that you can configure.

Use this kind of content source For this content

SharePoint sites

SharePoint sites from the same farm or different Microsoft SharePoint Server 2010, Microsoft SharePoint Foundation 2010, or Microsoft Search Server 2010 farms

SharePoint sites from the same farm or different Microsoft Office SharePoint Server 2007, Windows SharePoint Services 3.0, or Microsoft Search Server 2008 farms

SharePoint sites from Microsoft Office SharePoint Portal Server 2003 or Windows SharePoint Services 2.0 farms

Note

Unlike crawling SharePoint sites on SharePoint Server 2010, SharePoint Foundation 2010, or Search Server 2010, the crawler cannot automatically crawl all subsites in a site collection from previous versions of SharePoint Products and Technologies. Therefore, when crawling SharePoint sites from previous versions, you must specify the start address of each top-level site and the URL of each subsite that you want to crawl.

Web sites

Other Web content in your organization that is not located in SharePoint sites

Content on Web sites on the Internet

File shares

Content on file shares in your organization

Exchange public folders

Microsoft Exchange Server content

Lotus Notes

E-mail messages stored in Lotus Notes databases

Note

Unlike all other kinds of content sources, the Lotus Notes content source option does not appear in the user interface until you have installed and configured the appropriate prerequisite software. For more information, see Configure and use the Lotus Notes connector (Search Server 2010).

Business data

Business data that is stored in line-of-business applications

Plan content sources for business data

Business data content sources require that the applications hosting the data are specified in an Application Model in a Business Data Connectivity service application. You can create one content source to crawl all applications that are registered in the Business Data Connectivity service, or you can create separate content sources to crawl individual applications.

Often, the people who plan for integration of business data into site collections are not the same people involved in the overall content planning process. Therefore, include business application administrators in content planning teams so that they can advise you how to integrate the business application data into content and effectively present it in the site collections.

Crawl content on different schedules

You must decide whether some content is crawled more frequently than other content. The larger the volume of content that you crawl, the more likely it is that you are crawling content from different content repositories. The content might not be of the same type and might be located on servers of varying capacities. These factors make it more likely that you have to add content sources to crawl the different content repositories on different schedules.

Primary reasons that content is crawled on different schedules are as follows:

  • To accommodate down times and periods of peak usage.

  • To more frequently crawl content that is more frequently updated.

  • To crawl content that is located on slower servers separately from content that is located on faster servers.

In many cases, not all of this information can be known until after Search Server 2010 is deployed and has run for some time. In these cases, you must specify crawl schedules after the farm is in production. Nonetheless, it is a good idea to consider these factors during planning so that you can plan crawl schedules based on the information that you have.

The following two sections provide more information about crawling content on different schedules.

Considerations for planning crawl schedules

You can configure crawl schedules independently for each content source. For each content source, you can specify a time to do full crawls and a separate time to do incremental crawls. Note that you must run a full crawl for a particular content source before you can run an incremental crawl. Even if you specify an incremental crawl for content that has not yet been crawled, the system performs a full crawl.

Note

Because a full crawl crawls all content that the crawler encounters and has at least read access to, regardless of whether that content was previously crawled, full crawls can take significantly more time to complete than incremental crawls.

We recommend that you plan crawl schedules based on the availability, performance, and bandwidth considerations of the crawl and query servers.

When you plan crawl schedules, consider the following best practices:

  • Group start addresses in content sources based on similar availability and with acceptable overall resource usage for the servers that host the content.

  • Schedule incremental crawls for each content source during times when the servers that host the content are available and when there is low demand on the resources of the server.

  • Stagger crawl schedules so that the load on the servers in the farm is distributed over time.

  • Schedule full crawls only when you have to for the reasons listed in the next section. We recommend that you run full crawls less frequently than incremental crawls.

  • Schedule administration changes that require a full crawl to occur shortly before the planned schedule for full crawls. For example, we recommend that you schedule creating the crawl rule before the next scheduled full crawl so that an additional full crawl is not necessary.

  • Base concurrent crawls on the capacity available. For best performance, we recommend that you stagger the crawling schedules of content sources. You can optimize crawl schedules over time as you become familiar with the typical crawl durations for each content source.

Reasons to do a full crawl

Reasons for a Search service application administrator to do a full crawl include the following:

  • A software update or service pack was installed on servers in the farm. See the instructions for the software update or service pack for more information.

  • A Microsoft Office SharePoint Server 2007 shared services administrator or Search Server 2010 Search service application administrator added a new managed property. A full crawl is required for the new managed property to take effect immediately. If you do not want the new managed property to take effect immediately, a full crawl is not required.

  • You want to re-index ASPX pages on Windows SharePoint Services 3.0 or Microsoft Office SharePoint Server 2007 sites.

    Note

    The crawler cannot discover when ASPX pages on Windows SharePoint Services 3.0 or Office SharePoint Server 2007 sites have changed. Because of this, incremental crawls do not re-index views or home pages when individual list items are deleted. We recommend that you periodically do full crawls of sites that contain ASPX files to ensure that these pages are re-indexed.

  • You want to resolve consecutive incremental crawl failures. If an incremental crawl fails one hundred consecutive times at any level in a repository, the system removes the affected content from the index.

  • Crawl rules have been added, deleted, or modified.

  • You want to repair a corrupted index.

  • The Search service application administrator has created one or more server name mappings.

  • The credentials for the user account that is assigned to the default content access account or a crawl rule have changed.

The system does a full crawl even when an incremental crawl is requested under the following circumstances:

  • A search administrator stopped the previous crawl.

  • A content database was restored, or a farm administrator has detached and reattached a content database.

    Note

    If you are running Office SharePoint Server 2007 with the Infrastructure Update for Microsoft Office Servers or Search Server 2010, you can use the restore operation of the Stsadm command-line tool to change whether a content database restore causes a full crawl.

  • A full crawl of the site has never been done from this Search service application.

  • The change log does not contain entries for the addresses that are being crawled. Without entries in the change log for the items being crawled, incremental crawls cannot occur.

You can adjust schedules after the initial deployment based on the performance and capacity of servers in the farm and the servers hosting content.

Limit or increase the quantity of content that is crawled

For each content source, you can specify how extensively to crawl the start addresses. You also specify the behavior of the crawl by changing the crawl settings. The options that are available for a particular content source vary based on the content source type that you select. However, most crawl options specify how many levels deep in the hierarchy from each start address to crawl. Note that this behavior is applied to all start addresses in a particular content source. If you have to crawl some sites at deeper levels, you can create additional content sources that include those sites.

You can use crawl setting options to limit or increase the quantity of content that is crawled. The options available in the properties for each content source vary depending on the content source type that is selected. The following table describes best practices when you configure crawl setting options.

For this kind of content source If this pertains Use this crawl setting option

SharePoint sites

You want to include the content that is on the site itself and you do not want to include the content that is on subsites, or you want to crawl the content that is on subsites on a different schedule.

Crawl only the SharePoint site of each start address

SharePoint sites

You want to include the content on the site itself.

-or-

You want to crawl all content under the start address on the same schedule.

Crawl everything under the host name of each start address

Web sites

Content available on linked sites is unlikely to be relevant.

Crawl only within the server of each start address

Web sites

Relevant content is located on only the first page.

Crawl only the first page of each start address

Web sites

You want to limit how deep to crawl the links on the start addresses.

Custom — Specify the number of pages deep and number of server hops to crawl

Note

For a highly connected site, we recommend that you start with a small number, because specifying more than three pages deep or more than three server hops can crawl all the Internet.

File shares

Exchange public folders

Content available in the subfolders is unlikely to be relevant.

Crawl only the folder of each start address

File shares

Exchange public folders

Content in the subfolders is likely to be relevant.

Crawl the folder and subfolders of each start address

Business data

All applications that are registered in the BDC metadata store contain relevant content.

Crawl the whole BDC metadata store

Business data

Not all applications that are registered in the BDC metadata store contain relevant content.

-or-

You want to crawl some applications on a different schedule.

Crawl selected applications

Other considerations when planning content sources

You cannot crawl the same start addresses by using multiple content sources in the same Search service application. For example, if you use a particular content source to crawl a site collection and all its subsites, you cannot use a different content source to crawl one of those subsites separately on a different schedule.

In addition to considering crawl schedules, your decision about whether to group start addresses in a single content source or create additional content sources depends largely upon administration considerations. Administrators often make changes that update a particular content source. Changing a content source requires a full crawl of the content repository that is specified in that content source. To make administration easier, organize content sources in such a way that updating content sources, crawl rules, and crawl schedules is convenient for administrators.

Plan file-type inclusions and IFilters

Content is only crawled if the relevant file name extension is included in the file-type inclusions list and an IFilter is installed on the crawl server that supports those file types. Several file types and IFilters are included automatically during initial installation. When you plan for content sources in your initial deployment, determine whether content that you want to crawl uses file types that are not included. If file types are not included, you must add those file types on the Manage File Types page during deployment and ensure that an IFilter is installed and registered to support that file type.

If you want to exclude certain file types from being crawled, you can delete the file name extension for that file type from the file type inclusions list. Doing so excludes file names that have that extension from being crawled. For a list of file types and IFilters that are installed by default, see File types and IFilters reference (Search Server 2010).

Plan for authentication

When the crawler accesses the start addresses that are listed in content sources, the crawler must be authenticated by, and granted access to, the servers that host that content. This means that the domain account that is used by the crawler must have at least read permissions on the content.

By default, the system uses the default content access account. Alternatively, you can use crawl rules to specify a different content access account to use when crawling particular content. Whether you use the default content access account or a different content access account specified by a crawl rule, the content access account that you use must have read permissions on all content that is crawled. If the content access account does not have read permissions, the content is not crawled, is not indexed, and therefore is not available to queries.

We recommend that the account that you specify as the default content access account has access to most of your crawled content. Only use other content access accounts when security considerations require separate content access accounts.

For each content source that you plan, determine the start addresses that cannot be accessed by the default content access account, and then plan to add crawl rules for those start addresses.

Important

Ensure that the domain account that is used for the default content access account or any other content access account is not the same domain account that is used by an application pool associated with any Web application that you crawl. Doing so can cause unpublished content in SharePoint sites and minor versions of files (that is, history) in SharePoint sites to be crawled and indexed.

Another important consideration is that the crawler must use the same authentication protocol as the host server. By default, the crawler authenticates by using NTLM. You can configure the crawler to use a different authentication protocol, if it is necessary.

If you are using claims-based authentication, ensure that Windows authentication is enabled on any Web applications to be crawled.

Plan connectors

All content that is crawled requires that you use a connector (known as a protocol handler in previous versions) to gain access to that content. Search Server 2010 provides connectors for all common Internet protocols. However, if you want to crawl content that requires a connector that is not installed with Search Server 2010, you must install the third-party or custom connector before you can crawl that content. For a list of connectors that are installed by default, see Default connectors (Search Server 2010). For information about how to install connectors, see Install connectors (Search Server 2010).

Plan to manage the impact of crawling

Crawling content can significantly decrease the performance of the servers that host the content. The impact that this has on a particular server varies depending on the load that the host server is experiencing and whether the server has sufficient resources (especially CPU and RAM) to maintain service-level agreements under ordinary or peak usage.

Search administrators can use crawler impact rules to manage the impact the crawler has on the servers that are being crawled. For each crawler impact rule, you can specify a single URL or use wildcard characters in the URL path to include a block of URLs to which the rule applies. You can then specify how many concurrent requests for pages are made to the specified URL or decide to request only one document at a time and wait some seconds that you choose between requests.

Crawler impact rules specify the rate at which the crawler requests content from a particular start address or range of start addresses (also known as a site name). A crawler impact rule applies to all content sources in the Search service application and request frequencies apply per crawl component. The following table shows the wildcard characters that you can use in the site name when you are adding or editing a crawler impact rule.

This wildcard character Has this result

* as the site name

Applies the rule to all sites.

*.* as the site name

Applies the rule to sites that have dots in the name.

*.site_name.com as the site name

Applies the rule to all sites in the site_name.com domain (for example, *.adventure-works.com).

*.top-level_domain_name as the site name

Applies the rule to all sites that end with a specific top-level domain name, for example, *.com or *.net.

?

Replaces a single character in a rule. For example, *.adventure-works?.com applies to all sites in the domains adventure-works1.com, adventure-works2.com, and so on.

You can create a crawler impact rule that applies to all sites in a particular top-level domain. For example, *.com applies to all Internet sites that have addresses that end in .com. For example, an administrator of a portal site might add a content source for samples.microsoft.com. The rule for *.com applies to this site unless you add a crawler impact rule specifically for samples.microsoft.com.

You can coordinate with the administrators of search systems that are crawling content in your organization to set crawler impact rules based on the performance and capacity of the servers. For most external sites, this coordination is not possible. Requesting too much content on external servers or making requests too frequently can cause administrators of those sites to limit access if crawls are using too many resources. During initial deployment, set the crawler impact rules to make as small an impact on other servers as possible while still crawling enough content frequently enough to ensure that the freshness of the index meets your service-level agreement. After the farm is in production, you can adjust crawler impact rules based on data from crawl logs.

Plan crawl rules

Crawl rules apply to all content sources in the search service application. You can apply crawl rules to a particular URL or set of URLs to do the following things:

  • Avoid crawling irrelevant content by excluding one or more URLs. This also helps reduce the use of server resources and network traffic, and to increase the relevance of search results.

  • Crawl links on the URL without crawling the URL itself. This option is useful for sites that have links of relevant content when the page that contains the links does not contain relevant information.

  • Enable complex URLs to be crawled. This option directs the system to crawl URLs that contain a query parameter specified with a question mark. Depending upon the site, these URLs might not include relevant content. Because complex URLs can often redirect to irrelevant sites, it is a good idea to enable this option on only sites where you know that the content available from complex URLs is relevant.

  • Enable content on SharePoint sites to be crawled as HTTP pages. This option enables the system to crawl SharePoint sites that are behind a firewall or in scenarios in which the site being crawled restricts access to the Web service that is used by the crawler.

  • Specify whether to use the default content access account, a different content access account, or a client certificate for crawling the specified URL.

Because crawling content consumes resources and bandwidth, it is better to include a smaller amount of content that you know is relevant than a larger amount of content that might be irrelevant. After the initial deployment, you can review the query and crawl logs and adjust content sources and crawl rules to be more relevant and include more content.

Plan search settings that are managed at the farm level

Several settings that are managed at the farm level affect how content is crawled. Consider the following farm-level search settings while planning for crawling:

  • Contact e-mail address: Crawling content affects the resources of the servers that are being crawled. Before you can crawl content, you must provide in the configuration settings the e-mail address of the person in your organization whom administrators can contact if the crawl adversely affects their servers. This e-mail address appears in logs for administrators of the servers being crawled so that those administrators can contact someone if the impact of crawling on performance and bandwidth is too great, or if other issues occur.

    The contact e-mail address should belong to a person who has the necessary expertise and availability to respond quickly to requests. Alternatively, you can use a closely monitored distribution-list alias as the contact e-mail address. Regardless of whether the content that is being crawled is stored internally to the organization or not, quick response is important.

  • Proxy server settings: You can choose whether to use a proxy server when crawling content. The proxy server to use depends on the topology of your Search Server 2010 deployment and the architecture of other servers in your organization. You will likely have to use a proxy server when crawling Internet content. For more information about how to configure proxy server settings for search, see Configure farm-level proxy server settings (Search Server 2010) and Configure proxy server settings for search (Search Server 2010).

  • Time-out settings: Time-out settings are used to limit the time that the search system waits while connecting to other services.

  • SSL setting: The Secure Sockets Layer (SSL) setting determines whether the SSL certificate must exactly match to crawl content.

Plan for federation

Federated search is the concurrent querying of multiple Web resources or databases to generate a single search results page for end-users. When you add a federated location, end-users can search for and retrieve content that has not been crawled by servers in the local system. Federated locations enable queries to be sent to remote search engines and feeds. Accordingly, the system renders the results to end-users as if the federated content were part of the crawled content.

Search Server 2010 supports the following types of federated locations:

  • Search index on this server. You can use any local or remote site in your organization that has a server that is running Search Server 2010 as a federated location. For example, imagine that a SharePoint site on a Human Resources server in your company is the only available source of employee contact information. Even if the site is not part of your crawl scope, you can configure a federated location for it so that users who initiate a search from your Search Center site can retrieve employee contact information results that they are authorized to see. The following conditions apply:

    1. The location is set to Search Index on this Server.

    2. No query template is required. Search Server 2010 uses the object model to query a location.

    3. Default server authentication is used.

    4. Advanced search queries are not supported.

  • OpenSearch 1.0 or 1.1. You can use any public Web site that supports the OpenSearch standard as a federated location. An example of such a location is an Internet search engine such as Bing, or a search results page that supports RSS or Atom protocols. For example, imagine that you want users who search your internal sites for proprietary technical research to also see related research information from public Web sites. By configuring a federated location for a Bing search query, Web search results will be automatically included for users. The following conditions apply:

    1. Queries can be sent to a search engine as a URL, such as http://www.example.com/search.aspx?q=TEST.

    2. Search results are returned in RSS, Atom, or another structured XML format.

    3. Location capabilities, query templates, and response elements are part of an OpenSearch description (.osdx) file that is associated with the location.

    4. Extensions to OpenSearch that are specific to Search Server 2010 support the ability to include triggers and the ability to associate XSL code with search results.

    5. The choice of metadata to display in the search results is determined by the OpenSearch location.

    For more information about OpenSearch, visit http://www.opensearch.org.

When a search query is sent to a federated location, it is sent as URL parameters in a format called a query template. The system then formats and renders the results as XML for users of the Search Center site. The XML is displayed in a Web Part on the search results page as readable text. You can add and configure Web Parts on the search results page as a Federated Search Results Web Part, Top Federated Results Web Part, or Core Results Web Part. By default, the search results page contains three Federated Search Results Web Parts.

Consider the following questions when you are determining whether you want to display federated search results to users:

  1. Do you want to display custom results for particular searches? To help ensure that the federated location returns results that match specific queries, you can use trigger rules. When you create a trigger rule for a federated location, the Web Part that is associated with that location displays results only for user queries that match the pattern or prefix that you specify.

  2. Can you use a URL to specify which results to retrieve for a query? To create a federated location, you must specify a query template, which is the combination of the URL and the parameters that are required to send a search query and return the results as XML. When you add this information to the Query template field on the Add Federated Location page, you must format the string correctly (as shown in the example on the Add Federated Location page) or the search results provider will not return any results.

  3. Can users access the links that are provided by the federated location? If your organization grants only limited access to Internet resources, using an Internet search engine as a federated location might frustrate users because they will not be able to view some search results.

  4. Is authentication required? If the federated location requires authentication, you must provide the correct credentials. Many federated locations, such as Internet search engines, do not require credentials.

Plan authentication types for federation

Several kinds of user authentication, per-user and common credentials, are available for federated search. However, realize that collecting credentials requires a Web Part extension for non-Kerberos authentication types in per-user authentication. In the authentication and credentials information section of the location definition, you specify the authentication type for the federated location. The authentication type can be one of the following:

  • Anonymous

    No credentials are required to connect to the federated location.

  • Common

    Each connection uses the same set of credentials to connect to the federated location.

  • Per-user

    The credentials of the user who submitted the search query are used to connect to the federated location.

For the common and per-user authentication types, you must also specify one of the following authentication protocols:

  • Basic

    Basic authentication is part of the HTTP specification and is supported by most browsers.

    securitySecurity Note
    Web browsers that use Basic authentication transmit passwords that are not encrypted. By monitoring communications on the network, a malicious user can use publicly available tools to intercept and decode these passwords. Therefore, we do not recommend Basic authentication unless you are confident that the connection is secure, such as with a dedicated line or a Secure Sockets Layer (SSL) connection.
  • Digest

    Digest authentication relies on the HTTP 1.1 protocol as defined in the RFC 2617 specification at the World Wide Web Consortium (W3C) Web site. Because Digest authentication requires HTTP 1.1 compliance, some browsers do not support it. If a browser that is not compliant with HTTP 1.1 requests a file when Digest authentication is enabled, the request is rejected because Digest authentication is not supported by the client. Digest authentication can be used only in Windows domains. Digest authentication works with Windows Server 2008, Windows Server 2003, and Microsoft Windows 2000 Server domain accounts only, and may require the accounts to store passwords as encrypted plaintext.

  • NTLM

    User records are stored in the security accounts manager (SAM) database or in the Active Directory database. Each user account is associated with two passwords: the LAN Manager-compatible password and the Windows password. Each password is encrypted and stored in the SAM database or in the Active Directory database.

  • Kerberos (per-user authentication type only)

    By using the Kerberos protocol, a party at either end of a network connection can verify that the party on the other end is the entity it claims to be. Although NTLM enables servers to verify the identities of their clients, NTLM does not enable clients to verify a server’s identity, nor does NTLM enable one server to verify the identity of another. NTLM authentication is designed for a network environment in which servers are assumed to be trusted.

  • Forms-based

    A forms-based authentication cookie is nothing but the container for an authentication ticket. Each request passes the ticket as the value of the cookie and the ticket is used on the server to identify an authenticated user. However, cookieless forms-based authentication passes the ticket in the URL in an encrypted format. Cookieless forms-based authentication is used because client browsers might block cookies. This feature is introduced in the Microsoft .NET Framework 2.0.

If you are using claims-based authentication in your environment, ensure that Windows authentication is also enabled on any content sources to be crawled. For more information about authentication methods in SharePoint Server 2010, see Plan authentication methods (SharePoint Server 2010).