1 out of 1 rated this helpful - Rate this topic

Plan crawling and federation in SharePoint Server 2013

SharePoint 2013

Updated: May 14, 2013

Summary: Plan to crawl or federate results from different kinds of content and plan to apply the appropriate settings.

Applies to:  SharePoint Server 2013 

Before end-users can use search functionality in SharePoint 2013, you must crawl or federate the content that you want to make available for users to search. Effective search depends on well-planned content sources, connectors, file types, crawl rules, authentication, and federation.

In this article:

Introduction

You have the option to crawl content and build up a search index to run queries (search requests) against. You can also run queries against an external provider, such as Bing, which contains publicly available content on the Internet. You can display the search results from an external provider (for example Bing) alongside the results from the index that you've built up (for example of local intranet content). This process of getting search results from both a search index containing content that you have crawled and an external provider is called federation. Whether you choose to crawl or to federate, or both, depends on various factors.

About crawling and content processing

The SharePoint 2013 crawler can crawl various hosts and databases that contain content or data using different APIs such as web protocols. Content sources that you define in the Search service application tell the crawler which host should be crawled, how it should be crawled and when. The crawler connects the search system with the content repositories.

In the search architecture, the crawler (represented by the crawl component in the search topology) crawls specified content sources and then feeds the crawled items (their contents and also their metadata) to the content processing component. The content processing component then processes these crawled items (URLs, documents, etcetera) by reading, parsing and further processing the crawled properties within files or within the content itself. The content processing component reports the properties of the item (new crawled properties) to the Search Administration database.

You can map crawled properties to managed properties and configure property settings by editing the search schema. The content processing component reads this search schema and uses it to carry out the mapping. Only managed properties are included in the search index. Managed properties can be used to create refiners, for example. For more information, see Overview of the search schema in SharePoint Server 2013.

Plan content sources

A content source is a SharePoint search specific term that represents a group of crawl setting. When you set up a content source, you configure which hosts you want to crawl. You also indicate the type of content that will be crawled, for example HTTP, Lotus Notes or SharePoint content, and configure other crawl parameters such as the crawl schedule, how deep and when to crawl.

The default content source is Local SharePoint sites. You can use this content source to specify how to crawl all content in a web application that is associated with a particular Search service application.

If you have only one type of host (for example all content on the hosts is of the type SharePoint sites or file shares), you should consider defining only one content source. However, if you have multiple, different types of content or unique requirements per host, you should consider defining multiple content sources.

Plan additional content sources when you have to do the following:

  • Crawl different types of content — for example, SharePoint sites, file shares, and business data.

  • Crawl some content on different schedules than other content.

  • Limit or increase the quantity of content that is crawled.

  • Set different priorities for crawling different sites.

  • Keep some types of content fresher than others.

Although you can create an unlimited number of content sources in each Search service application, each content source can contain up to 100 start addresses. To keep administration as easy as possible, we recommend that you limit the number of content sources that you create.

Plan to crawl different kinds of content

You can only crawl one kind of content per content source. That is, you can create a content source that contains start addresses for SharePoint sites and another content source that contains start addresses for file shares. However, you cannot create a single content source that contains start addresses to both SharePoint sites and file shares, for example. The following table lists the kinds of content sources that you can configure.

Use this kind of content source For this content

SharePoint sites

SharePoint sites from the same farm or different SharePoint Server 2013 or SharePoint Foundation 2013 farms.

SharePoint sites from the same farm or different SharePoint Server 2010, SharePoint Foundation 2010, or Microsoft Search Server 2010 farms.

SharePoint sites from the same farm or different Office SharePoint Server 2007, Windows SharePoint Services 3.0, or Search Server 2008 farms.

Web sites

Other web content in your organization that is not located in SharePoint sites.

Content on web sites on the Internet.

File shares

Content on file shares in your organization.

Exchange public folders

Exchange Server 2013 content.

Lotus Notes

E-mail messages stored in Lotus Notes databases.

note Note:

Unlike all other kinds of content sources, the Lotus Notes content source option does not appear in the user interface until you have installed and configured the appropriate prerequisite software. For more information, see Configure and use the Lotus Notes connector for SharePoint Server 2013.

Documentum

Content from the EMC Documentum system.

note Note:

You can’t crawl EMC Documentum content before you have installed and configured the appropriate prerequisite software and the Microsoft SharePoint 2013 Indexing Connector for Documentum. For more information, see Configure and use the Documentum connector in SharePoint Server 2013.

Line of business data

Business data that is stored in line-of-business applications.

Custom repository

Content sources that can only be crawled after a custom connector is installed and registered.

Content sources for line of business data

Business data content sources require that the applications hosting the data are specified in an Application Model in a Business Data Connectivity service application. You can create one content source to crawl all applications that are registered in the Business Data Connectivity service, or you can create separate content sources to crawl individual applications. For more information, see Search connector framework in SharePoint 2013 (MSDN).

Often, the people who plan for integration of business data into site collections are not the same people involved in the overall content planning process. Therefore, include business application administrators in content planning teams so that they can advise you how to integrate the business application data into content and effectively present it in the site collections.

Crawl content on different schedules

Consider defining content sources with varying schedules based on the following factors:

  • To accommodate down times and periods of peak usage.

  • To more frequently crawl content that is more frequently updated.

  • To crawl content that is located on slower servers separately from content that is located on faster servers.

  • To continuously crawl a SharePoint content source because of high freshness demands.

Reasons to do a full crawl

Reasons for a Search service application administrator to do a full crawl include the following:

  • A software update or service pack was installed on servers in the farm. See the instructions for the software update or service pack for more information.

  • A Search service application administrator, site collection administrator or tenant administrator added a new managed property or changed an existing managed property. A full crawl is required for the new or changed managed property to take effect.

  • You want to detect security changes that were made to local groups on a file share after the last full crawl of the file share.

  • You want to resolve consecutive incremental crawl failures. If an incremental crawl fails a large number of consecutive times at any level in a repository, the system removes the affected content from the search index.

  • Crawl rules have been added, deleted, or modified.

  • You want to repair a corrupted search index.

  • The credentials for the user account that is assigned to the default content access account have changed. A full crawl is required only if the permissions of this user account have changed.

The system does a full crawl even when an incremental crawl or continuous crawl is scheduled under the following circumstances:

  • A search administrator stopped the previous crawl.

  • A content database was restored, or a farm administrator has detached and reattached a content database.

  • A full crawl of the content source has never been done from this Search service application.

  • The crawl database does not contain entries for the addresses that are being crawled. Without entries in the crawl database for the items being crawled, incremental crawls cannot occur.

Limit or increase the quantity of content that is crawled

The options available in the properties for each content source vary depending on the content source type that you select. You can use crawl setting options to limit or increase the quantity of content that is crawled. For each content source, you can specify how extensively to crawl the start addresses. Most content source types allow you to specify how many levels deep in the hierarchy from each start address to crawl. Note that this behavior is applied to all start addresses in a particular content source. If you have to crawl some sites at deeper levels, you can create additional content sources that include those sites. The following table describes best practices when you configure crawl setting options.

For this kind of content source If this pertains Use this crawl setting option

SharePoint sites

You want to include the content that is on the site itself and you do not want to include the content that is on subsites, or you want to crawl the content that is on subsites on a different schedule.

Crawl only the SharePoint site of each start address.

SharePoint sites

You want to include the content on the site itself.

-or-

You want to crawl all content under the start address on the same schedule.

Crawl everything under the host name of each start address.

Web sites

Content available on linked sites is unlikely to be relevant.

Crawl only within the server of each start address.

Web sites

Relevant content is located on only the first page.

Crawl only the first page of each start address.

Web sites

You want to limit how deep to crawl the links on the start addresses.

Custom — Specify the number of pages deep and number of server hops to crawl.

note Note:

For a highly connected site, we recommend that you start with a small number, because specifying more than three pages deep or more than three server hops can crawl all the Internet.

File shares

Exchange public folders

Content available in the subfolders is unlikely to be relevant.

Crawl only the folder of each start address.

File shares

Exchange public folders

Content in the subfolders is likely to be relevant.

Crawl the folder and subfolders of each start address.

Business data

All applications that are registered in the Business Data Catalog metadata store contain relevant content.

Crawl the whole Business Data Catalog metadata store.

Business data

Not all applications that are registered in the BDC metadata store contain relevant content.

-or-

You want to crawl some applications on a different schedule.

Crawl selected applications.

Plan connectors

All content that is crawled requires you to use a connector (known as a protocol handler in earlier versions) to acquire and index that content. SharePoint 2013 provides connectors for the most popular protocols. However, if you want to crawl content that requires a connector that is not provided by default, you should consider installing a third-party or build a custom connector before you can crawl that content. For a list of connectors that are installed by default, see Default connectors in SharePoint Server 2013.

Other considerations when planning content sources

In addition to considering crawl schedules, your decision about whether to group start addresses in a single content source or create additional content sources depends largely upon administration considerations. To make administration easier, organize content sources in such a way that updating content sources, crawl rules, and crawl schedules is convenient for administrators.

Warning Warning:

Changing a content source requires a full crawl.

  • You can’t crawl the same start addresses by using multiple content sources in the same Search service application. For example, if you use a particular content source to crawl a site collection and all its subsites, you cannot use a different content source to crawl one of those subsites separately on a different schedule.

  • Administrators often make changes that update a particular content source. Since changing a content source requires a full crawl of the content repository that is specified in that content source, consider creating separate content sources to make sure that a full crawl does not take a very long time.

Plan crawl rules to optimize crawls

Crawl rules apply to all content sources in the Search service application. You can apply crawl rules to a particular URL or set of URLs to do the following things:

  • Avoid crawling irrelevant content by excluding one or more URLs. This also helps reduce the use of server resources and network traffic.

  • Crawl links on the URL without crawling the URL itself. This option is useful for sites that have links of relevant content when the page that contains the links does not contain relevant information.

  • Enable complex URLs to be crawled. This option directs the system to crawl URLs that contain a query parameter specified with a question mark. Depending upon the site, these URLs might not include relevant content. Because complex URLs can often redirect to irrelevant sites, it is a good idea to enable this option on only sites where you know that the content available from complex URLs is relevant.

  • Enable content on SharePoint sites to be crawled as HTTP pages. This option enables the system to crawl SharePoint sites that are behind a firewall or in scenarios in which the site being crawled restricts access to the Web service that is used by the crawler.

  • Specify whether to use the default content access account, a different content access account, or a client certificate for crawling the specified URL.

Because crawling content consumes resources and bandwidth, it is better to include a smaller amount of content that you know is relevant than a larger amount of content that might be irrelevant. After the initial deployment, you can review the query and crawl logs and adjust content sources and crawl rules to be more relevant and include more content.

Plan crawler authentication

When the crawler accesses the start addresses that are listed in content sources, the crawler must be authenticated by, and granted access to, the servers that host that content. By default, the system uses the default content access account. Or, you can use crawl rules to specify a different content access account to use when crawling particular content. Whether you use the default content access account or a different content access account specified by a crawl rule, the content access account that you use must have at least read permissions on all content that is crawled. If the content access account does not have read permissions, the content is not crawled, is not indexed, and therefore is not available to queries.

We recommend that the account that you specify as the default content access account has access to most of your crawled content. Only use other content access accounts when security considerations require separate content access accounts.

For each content source that you plan, determine the start addresses that cannot be accessed by the default content access account, and then plan to add crawl rules for those start addresses.

Important Important:

Ensure that the domain account that is used for the default content access account or any other content access account is not the same domain account that is used by an application pool associated with any Web application that you crawl. Doing so can cause unpublished content in SharePoint sites and minor versions of files (that is, history) in SharePoint sites to be crawled and indexed.

Another important consideration is that the crawler must use the same authentication protocol as the host server. By default, the crawler authenticates by using NTLM. You can configure the crawler to use a different authentication protocol, if it is necessary.

If you are using claims-based authentication, make sure that Windows authentication is enabled on any Web applications to be crawled.

Plan content processing

Include or exclude file types

Any content from any file type that you want to include in the search index has to be crawled by the crawl component and parsed by the content processing component. If the crawl component cannot parse the file type, the search index only includes file properties, for example the title. The crawl component can only crawl a file if the list on the Manage File Types page includes the file name extension. The content processing component can only parse the contents of a crawled file when:

  • SharePoint 2013 has a format handler that can parse the file format.

  • SharePoint 2013 is configured to use the format handler to parse files with the file name extension.

SharePoint 2013 satisfies these requirements for many file types by default, and you can crawl and parse these file types without installing additional format handlers. See Default crawled file name extensions and parsed file types in SharePoint Server 2013 for an overview of these file types.

note Note:

The content processing component uses a format handler to read files with a specific format. You can extend the initial collection of file formats that SharePoint 2013 can parse by adding third-party filter-based format handlers, known as iFilters. You cannot override a built-in format handler with a third-party iFilter.

When you plan to include content in the search index from content sources with file types that are not on the Manage File Types page, review the following:

  • To crawl the file type, you must add the file type on the Manage File Types page.

  • To parse the file type when SharePoint 2013 has a format handler for it, you must for each content processing component in the Search service application enable parsing of files with the file format and file name extension.

  • To parse the file type when SharePoint 2013 does not have a format handler for it, you must:

    • Install a third-party iFilter for that file format and file name extension on each server that hosts a content processing component.

    • Enable parsing of the file type and file name extension for each content processing component in the Search service application.

For more information, see Add or remove a file type from the search index in SharePoint Server 2013

Plan to use (custom) entity extractors

You can configure the search system to look for "entities" in unstructured content, such as in the body text or the title of a document. These entities can be words or phrases, such as product names. To specify which entities to look for, you can create and deploy your own dictionaries.

The extracted entities are stored in the search index as separate managed properties, which are automatically configured to be searchable, queryable, retrievable, sortable and refinable. You can use those properties in search refiners, for example, to help users filter their search results.

For companies, you can use the pre-populated company extraction dictionary that SharePoint 2013 provides.

In addition, you can deploy several types of custom entity extractors in the form of custom entity extraction dictionaries. You deploy these dictionaries using Windows PowerShell. The entries in these dictionaries (single or multiple words) will be matched on words or parts of words in the content in a case-sensitive or case-insensitive way. For more information, see Create and deploy custom entity extractors in SharePoint Server 2013.

Custom entity extractor / dictionary Description

Word Extraction

Case-insensitive, maximum 5 dictionaries. For example, the entry "anchor" would match "anchor" and "Anchor", but not "anchorage".

Word Part Extraction

Case-insensitive, maximum 5 dictionaries. For example, the entry "anchor" would match "anchor", "Anchor" and within "anchorage".

Word Exact Extraction

Case-sensitive, maximum 1 dictionary. For example, the entry "anchor" would match "anchor", but not "Anchor" or "Anchorage".

Word Part Exact Extraction

Case-sensitive, maximum 1 dictionary. For example, the entry "anchor" would match "anchor" and within "anchorage", but not "Anchor".

About result sources and federated search

In SharePoint 2013, you use result sources to specify which providers to get search results from, which protocol to use to get those results and which authentication method to use to connect to the result source. You can also use query rules on a result source to narrow down a search to a subset of results from that particular result source. For more information, see Understanding result sources for search in SharePoint Server 2013.

Federated search is the concurrent querying of local and external search providers and displaying the results on a single results page for end-users. When you add a result source that specifies an external provider such as a remote search engine or feed, end-users can search for and retrieve content that has not been crawled by servers in the local farm.

Change History

Date Description

May 14, 2013

Added a section describing the reasons to do a full crawl and added more information about content sources for line of business data.

July 16, 2012

Initial publication

Did you find this helpful?
(1500 characters remaining)
© 2013 Microsoft. All rights reserved.