Plan for crawling and federation (FAST Search Server 2010 for SharePoint)

 

Applies to: FAST Search Server 2010

This article discusses how to plan for crawling and federation by helping you understand how Microsoft FAST Search Server 2010 for SharePoint federates, crawls and indexes content.

Before end-users can use the enterprise search functionality in FAST Search Server 2010 for SharePoint, you must crawl or federate the content that you want to make available for users to search.

This article separates between three types of indexing connectors:

  1. The Microsoft SharePoint Server 2010 indexing connectors and crawling framework

    Most content sources can be crawled using this framework, either by using the integrated indexing connectors or Business Connectivity Services. You use the SharePoint Server 2010 Central Administration for most configuration and operation tasks.

  2. Federated search connectors

    Federated search connectors enable you to pass a query to a target system and display results returned from that system without actually crawling that content. You use the SharePoint Server 2010 Central Administration for most configuration and operation tasks.

  3. The FAST Search Server 2010 for SharePoint specific indexing connectors

    FAST Search Server 2010 for SharePoint offers three additional indexing connectors to crawl web, database and Lotus Notes content. These indexing connectors are configured mainly by editing XML files and Windows PowerShell cmdlets and you operate them by using the command line.

In this article:

  • Identify content sources and determine which indexing connector to use

    This section will help you to determine which indexing connector to use.

  • Plan to use indexing connectors integrated in SharePoint Server 2010

    This section helps you plan to use the SharePoint Server 2010 indexing connector framework and the FAST Search Content Search Service Application.

    • Plan content sources

    • Plan content sources for business data

    • Plan indexing connector protocols

    • Plan file-type inclusions and IFilters

    • Plan crawl schedules, crawl rules and manage the impact of crawling

    • Plan for authentication

  • Plan for federation

    This section helps you plan how to include federated search results for end-user queries using the SharePoint Server 2010 framework and the FAST Search Query Search Service Application.

  • Plan to use the FAST Search Server 2010 for SharePoint indexing connectors

    This section helps you plan to use the FAST Search Server 2010 for SharePoint specific connectors.

    • About the FAST Search Web crawler

    • About the FAST Search database connector

    • About the FAST Search Lotus Notes connector

    • Include or exclude content from a crawl

    • Set crawl schedules

Identify content sources and determine which indexing connector to use

FAST Search Server 2010 for SharePoint uses different indexing connectors for different content sources. The choice of indexing connector is influenced by the kind of content that you want to crawl, by personal preference and by specific needs of your organization.

Most content sources can be crawled using the various indexing connectors offered through Microsoft SharePoint Server 2010. In the Central Administration user interface, the collection of these indexing connectors is referred to as the FAST Search connector. This is not one separate indexing connector, but rather a collection of several indexing connectors. The FAST Search connector is associated with one or more content sources (and therefore indexing connectors) through the FAST Search Content Search Service Application (Content SSA). The Content SSA also connects the Microsoft SharePoint Server 2010 front-end with the FAST Search Server 2010 for SharePoint back-end.

When you install FAST Search Server 2010 for SharePoint, you also have access to three FAST Search Server 2010 for SharePoint specific indexing connectors. These connectors can feed web, database and Lotus Notes content to the index. The table summarizes the available indexing connectors and their recommended use cases.

Type of content Indexing connector Recommended use case

SharePoint

SharePoint indexing connector

Use in all use cases.

File shares

File share indexing connector

Use in all use cases.

Exchange public folders

Exchange indexing connector

Use in all use cases.

People profiles

User profiles indexing connector

Use in all use cases.

This kind of content is crawled through the FAST Search Query Search Service Application.

Web sites

Web site indexing connector

Use when you have a limited amount of Web sites to crawl, without dynamic content.

FAST Search Web crawler

Use when you have many web sites to crawl.

Use when the web site content contains dynamic data, including JavaScript.

Use when the organization needs access to advanced web crawling, configuration and scheduling options.

Use when you want to crawl RSS web content.

Use when the web site content uses advanced logon options.

Database

Business Data Catalog-based indexing connectors

Use if the preferred configuration method is using the Microsoft SharePoint Designer 2010.

Use when you want to use time stamp based change detection for incremental database crawls.

Use when the preferred operation method is using the Microsoft SharePoint Server 2010 Central Administration.

Use when you want to enable crawling based on the change log. This can be achieved by directly modifying the connector model file and creating a stored procedure in the database.

FAST Search database connector

Use when the preferred configuration method is using SQL queries.

Use when you want advanced data joining operation options through SQL queries.

Use when you want to use advanced incremental update features. FAST Search database connector uses checksum based change detection for incremental crawls if there is no update information available. The connector also supports time stamp based change detection and change detection based on update and delete flags.

Lotus Notes

Lotus Notes indexing connector

Use when the preferred operation method is using the Microsoft SharePoint Server 2010 Central Administration.

FAST Search Lotus Notes connector

Use when full Lotus Notes security support is required, including support for Lotus Notes roles.

Use when you want to crawl Lotus Notes databases as attachments.

Line of Business Data

Business Data Catalog-based indexing connectors

Use when the data in your content source contains data in line of business applications.

Use when you want to enable crawling based on the change log. This can be achieved by directly modifying the connector model file and creating a stored procedure in the database.

About crawling and indexing content

The result of successfully crawling content is that the individual files or pieces of content that you want to make available to search queries are accessed and read by the indexing connector. By crawling the content, you create a set of crawled properties for those items. These crawled properties are mapped to managed properties that are stored in the search index, also known as the index.

Note

The indexing connectors do not change the files on the host servers. The files on the host servers are only accessed and read, they are not modified. In some cases, the last accessed date on files that have been crawled may be updated, as the indexing connectors read the content on the host server. This applies only to some servers that host certain content sources.

Plan to use indexing connectors integrated in SharePoint Server 2010

Most content sources can be crawled using the indexing connectors integrated in SharePoint Server 2010. You use the SharePoint Server 2010 Central Administration for most configuration and operation tasks.

These indexing connectors are set up by configuring the FAST Search Content Search Service Application (Content SSA). Among other things, the Content SSA enables communication with the FAST Search Server 2010 for SharePoint back-end. Within the Content SSA you specify the location of the content source(s), the crawl schedule and other information. The Content SSA feeds to the (default) content collection that is named sp.

The FAST Search connector crawls:

  • SharePoint sites

  • Web sites

  • File shares that contain content such as Microsoft Office documents

  • Exchange public folders

  • Line of business data, for example content from databases

  • Custom repositories, accessed with a custom built connector

Plan content sources

A content source in the FAST Search Content Search Service Application (Content SSA) is a set of options that you can use to specify what kind of content is crawled, what URLs to crawl, and how deep and when to crawl. The default content source is Local SharePoint sites. You can use this content source to specify how to crawl all content in all web applications that are associated with a particular Content SSA. By default, for each web application that uses a particular Content SSA, FAST Search Server 2010 for SharePoint adds the start address of the top-level site of each site collection to the default content source.

Some organizations can use the default content source to satisfy their search requirements. However, many organizations have to have additional content sources. Plan additional content sources when you have to do the following:

  • Crawl different kinds of content — for example, SharePoint sites, file shares, and business data.

  • Crawl some content on different schedules than other content.

  • Limit or increase the quantity of content that is crawled.

  • Set different priorities for crawling different sites.

You can create up to 500 content sources in the Content SSA, and each content source can contain as many as 500 start addresses. To keep administration as simple as possible, we recommend that you limit the number of content sources that you create.

Plan to crawl different kinds of content

You can only crawl one kind of content per content source. That is, you can create a content source that contains start addresses for SharePoint sites and another content source that contains start addresses for file shares. However, you cannot create a single content source that contains start addresses to both SharePoint sites and file shares. The following table lists the kinds of content sources that you can configure.

Use this kind of content source For this content

SharePoint sites

SharePoint sites from the same farm or different Microsoft SharePoint Server 2010, Microsoft SharePoint Foundation 2010, or Microsoft Search Server 2010 farms

SharePoint sites from the same farm or different Microsoft Office SharePoint Server 2007, Windows SharePoint Services 3.0, or Microsoft Search Server 2008 farms

SharePoint sites from Microsoft Office SharePoint Portal Server 2003 or Windows SharePoint Services 2.0 farms

Note

Unlike crawling SharePoint sites on SharePoint Server 2010, SharePoint Foundation 2010, or Search Server 2010, the SharePoint Server 2010 crawler cannot automatically crawl all subsites in a site collection from previous versions of SharePoint Products and Technologies. Therefore, when crawling SharePoint sites from previous versions, you must specify the start address of each top-level site and the URL of each subsite that you want to crawl.

Web sites

Other web content in your organization that is not located in SharePoint sites

Content on Web sites on the Internet

File shares

Content on file shares in your organization

Exchange public folders

Microsoft Exchange Server content

Lotus Notes

E-mail messages stored in Lotus Notes databases

Note

Unlike all other kinds of content sources, the Lotus Notes content source option does not appear in the user interface until you have installed and configured the appropriate prerequisite software. For more information, see Configure and use the Lotus Notes connector (FAST Search Server 2010 for SharePoint).

Business data

Business data that is stored in line-of-business applications

Plan content sources for business data

Business data content sources require that the applications hosting that data are specified in an Application Model in a Business Data Connectivity service application. You can create one content source to crawl all applications that are registered in the Business Data Connectivity service, or you can create separate content sources to crawl individual applications.

Often, the people who plan for integration of business data into site collections are not the same people involved in the overall content planning process. Therefore, include business application administrators in content planning teams so that they can advise you how to integrate the business application data into content and effectively present it in the site collections.

About Business Connectivity Services models

To crawl certain repositories, for example databases or Web services, you need the SharePoint Server 2010 search connector framework. This framework enables you to use a Business Connectivity Service (BCS) model to crawl external data sources. These model files define the connection details and structure of the external content source that you plan to crawl. The BCS models are imported into the Business Connectivity Service. You will point to this model and service when you set up the Line of Business Data or Custom Repository type content source.

There are several predesigned BCS models you can use for database content and Web services (WCF). In addition, you can also create your own, custom BCS model and custom connector using the connector framework and BCS models.

To build on the SharePoint Server 2010 search connector framework, you must use either SharePoint Designer or Microsoft Visual Studio 2010, depending on your specific requirements and goals.

Use SharePoint Designer to:

  • Create BCS models that are needed to crawl out of the box supported external content sources such as databases and Web services.

  • Import/export models between BCS applications

Use Microsoft Visual Studio to:

  • Implement methods for .NET BCS Connector

  • Write a custom connector for your repository

Multiple content sources can all pull from the same Business Connectivity Service (BCS), and you can point different Search Service Applications to the same model in a shared BCS.

For more information about the SharePoint Server 2010 connector framework, Business Connectivity Services and creating custom connectors, refer to SharePoint Server Search Connector Framework (MSDN).

Plan indexing connector protocols

All content that is crawled requires you to use a connector to gain access to that content. FAST Search Server 2010 for SharePoint (through the SharePoint Server 2010 connector framework) provides connectors for all common Internet protocols. However, if you want to crawl content that requires a connector that is not installed with SharePoint Server 2010, you must install the third-party or custom connector before you can crawl that content. For a list of indexing connector protocols that are installed by default, see Default indexing connector protocols (FAST Search Server 2010 for SharePoint). You can install additional indexing connectors and protocols to crawl content created by other Microsoft products or third-party software. For more information, see Content sources that require additional configuration (FAST Search Server 2010 for SharePoint).

Plan file-type inclusions and IFilters

FAST Search Server 2010 for SharePoint crawls and extracts metadata and content from most common file types. Several file types and IFilters are included automatically during initial installation. When you plan for content sources in your initial deployment, determine whether content that you want to crawl uses file types that are not included. If file types are not included, you must add those file types either by enabling the Advanced Filter Pack or by installing and registering a third-party IFilter to support that file type.

If you want to exclude certain file types from being crawled, you can add the file name extension for that file type to the file type exclusions list. Doing so excludes file names that have that extension from being crawled. For a list of file types and IFilters that are supported or excluded by default, see IFilter and file type reference (FAST Search Server 2010 for SharePoint).

Plan crawl schedules, crawl rules and manage the impact of crawling

Several factors determine whether and how many crawl schedules and crawl rules to plan for and the extent to which you have to manage the impact of crawling when you use the Content SSA and the SharePoint crawler to crawl content.

Note

Before you can start incremental crawls of one or more content sources, the system must first complete a full crawl.

Example reasons that require you to configure crawl schedules, crawl rules and/or manage the impact of crawling are:

  • To accommodate down times and periods of peak usage.

  • To more frequently crawl content that is frequently updated.

  • To crawl content that is located on slower servers separately from content that is located on other servers.

  • To exclude content from crawls that is likely to be less relevant.

  • To decrease or increase the request frequency for particular (external) Web sites or content servers.

  • To crawl content with a different account than the default content access account.

For more detailed information and additional considerations, refer to the relevant sections in the SharePoint Server 2010 topic Plan for crawling and federation (SharePoint Server 2010). For more information about crawling content on different schedules, see the sections Considerations for planning crawl schedules and Reasons to do a full crawl. Read the section Plan to manage the impact of crawling if you want to learn more about crawler impact rules.

Plan for authentication

When the SharePoint Server 2010 crawler accesses the start addresses that are listed in content sources, the SharePoint Server 2010 crawler must be authenticated by, and granted access to, the servers that host that content. This means that the domain account that is used by the SharePoint Server 2010 crawler must have at least read permissions on the content.

By default, the system uses the default content access account. Or, you can use crawl rules in the FAST Search Content SSA to specify a different content access account to use when crawling particular content. Whether you use the default content access account or a different content access account specified by a crawl rule, the content access account that you use must have read permissions on all content that is crawled. If the content access account does not have read permissions, the content is not crawled, is not indexed, and therefore not available to queries.

For more information, see the section Plan for authentication in the in the SharePoint Server 2010 topic Plan for crawling and federation.

Plan for federation

Federated search is the concurrent querying of multiple web resources or databases to generate a single search results page for end-users. In FAST Search Server 2010 for SharePoint, you configure federated locations in the FAST Search Query Search Service Application so that end-users can search for and retrieve content that has not been crawled by servers in the local system. Federated locations enable queries to be sent to remote search engines and feeds. Accordingly, the system renders the results to end-users as if the federated content were part of the crawled content.

FAST Search Server 2010 for SharePoint, through SharePoint Server 2010, supports the following kinds of federated locations:

  • Search index on this server. You can use a local index in your organization that has a server that is running SharePoint Server 2010 as a federated location. For example, imagine that a SharePoint site on a Human Resources server in your company is the only available source of employee contact information. Even if the site is not part of your crawl scope, you can configure a federated location for it so that users who start a search from your Search Center site can retrieve employee contact information results that they are authorized to see. The following conditions apply:

    1. The location is set to Search index on this server.

    2. No query template is required. SharePoint Server 2010 uses the object model to query a location.

    3. Default server authentication is used.

    4. Advanced search queries are not supported.

  • FAST Search index.

    Use this option if you want to federate results from a local FAST Search Server 2010 for SharePoint index into a Search Center or a FAST Search Center.

  • OpenSearch 1.0 or 1.1. You can use any public Web site that supports the OpenSearch standard as a federated location. An example of such a location is an Internet search engine such as Bing, or a search results page that supports RSS or Atom protocols. For example, imagine that you want users who search your internal sites for proprietary technical research to also see related research information from public Web sites. By configuring a federated location for a Bing search query, web search results will be automatically included for users. The following conditions apply:

    1. Queries can be sent to a search engine as a URL, such as http://www.example.com/search.aspx?q=TEST.

    2. Search results are returned in RSS, Atom, or another structured XML format.

    3. Location capabilities, query templates, and response elements are part of an OpenSearch description (.osdx) file that is associated with the location.

    4. Extensions to OpenSearch that are specific to FAST Search Server 2010 for SharePoint enable you to include triggers and the ability to associate XSL code with search results.

    5. The choice of metadata to display in the search results is determined by the OpenSearch location.

    For more information about OpenSearch, visit http://www.opensearch.org.

For more information about federation and about planning authentication types for federation, consult the section Plan for federation in the SharePoint Server 2010 topic Plan for crawling and federation.

An overview of federated search connectors that you can use to import a federated location can be found on the Federated Search Connector Gallery on the Enterprise Search Tech Center.

Plan to use the FAST Search Server 2010 for SharePoint indexing connectors

In addition to the indexing connectors integrated in Microsoft SharePoint Server 2010, FAST Search Server 2010 for SharePoint offers additional content indexing connectors for web, Lotus Notes and database content.

These indexing connectors are configured mainly by editing XML files and Windows PowerShell cmdlets and you operate them by using the command line.

About the FAST Search Web crawler

The FAST Search Web crawler is a customizable indexing connector used to crawl Web site content. The FAST Search Web crawler can scale up to large environments, for example when your organization is crawling many external Web sites. In addition, the FAST Search Web crawler can crawl dynamic web content, such as Web sites that contain JavaScript.

The FAST Search Web crawler collects content from a set of defined Web sites, which can be internal or external. The configuration of the FAST Search Web crawler is done by editing a copy of an XML file. You can operate the FAST Search Web crawler through several command line tools.

The FAST Search Web crawler is typically a component inside a FAST Search Server 2010 for SharePoint installation. Internally, the FAST Search Web crawler is organized as a collection of processes and logical entities, which in most cases run on a single server. When the number of Web sites or total number of pages to be crawled is large, the FAST Search Web crawler can be scaled up by distributing these processes across multiple hosts. This requires additional configuration.

The FAST Search Web crawler can crawl HTTP, HTTPS and FTP content and supports NTLM version 1 (and to a limited extend version 2), Digest, basic auth and form based logon authentication. RSS scheduling is supported and you can tag linked documents from the feed.

About the FAST Search database connector

The FAST Search database connector is a specialized indexing connector that collects content from database content sources.

The indexing connector is configured by using an XML template. You operate the connector by using the command-line options from the jdbcconnector.bat file. After running the configured connector, you map crawled properties to managed properties in the SharePoint Server 2010 Central Administration to enable and customize search on the content collected by the connector.

The connector uses an SQL statement to run against the crawl database. This statement is completely customizable. The FAST Search database connector uses checksum based change detection for incremental crawls if there is no update information available. The connector also supports time stamp based change detection and change detection based on update and delete flags. Also, you can indicate pre and post operation procedures that have to be done to the database before it is crawled, which can be an advantage in certain use cases.

About the FAST Search Lotus Notes connector

The FAST Search Lotus Notes connector is a specialized indexing connector that consists of two parts: a user directory connector and a content connector. The content connector collects content from a Lotus Notes content source. The user directory connector ensures that the end-users can only search Lotus Notes content that they have access to. The user directory connector maps the Active Directory user directory with the Lotus Notes user accounts and is closely integrated with FAST Search Authorization.

The connector is configured by using two XML templates, one for the user directory connector and one for the content connector. You operate the connector by using the command-line options from the lotusnotesconnector.bat and lotusnotessecurity.bat files. After running the configured content connector, you map crawled properties to managed properties in the SharePoint Server 2010 Central Administration to enable and customize search on the content collected by the content connector.

The FAST Search Lotus Notes connector supports Lotus Notes version 6.5.6, 7.x and 8.x and Lotus Domino version 6.5, 7.x and 8.x.

The connector fully supports Lotus Notes security, including roles, and can index Lotus Notes databases as attachments.

Include or exclude content from a crawl

The FAST Search Server 2010 for SharePoint specific connectors each have parameters in their respective configuration files to indicate include and exclude rules.

Important

Do not overload any content sources that you plan to crawl.

For content within your organization that other administrators are crawling, you can coordinate with those administrators to set impact rules based on the performance and capacity of the servers. For most external sites, this coordination is not possible. Requesting too much content on external servers or making requests too frequently can cause administrators of those sites to limit your future access if your crawls are using too many resources or too much bandwidth. Therefore, the best practice is to crawl more slowly. In in this manner, you can reduce the risk of losing access to crawl the relevant content.

With the FAST Search Web crawler, you can control the crawl rate by setting a request delay, set a maximum to the number of concurrent requests that are sent to the same Web site at the same time or enable or disable concurrent crawling of an IP address where multiple sites are hosted. You can also limit the bandwidth of the FAST Search Web crawler by limiting the number of concurrent Web sites to crawl at the same time.

Set crawl schedules

The FAST Search Lotus Notes connector and the FAST Search database connector use the Windows Task Scheduler to schedule crawls. Scheduling crawls for the FAST Search Web crawler is possible by setting parameters in the XML configuration file.

Tip

It is recommended to complete a manually started full crawl cycle before scheduling additional crawls. This to find out how long it takes to run a full crawl and to avoid starting a new or incremental crawl before the initial crawl has finished.

See Also

Concepts

Plan to deploy FAST Search specific connectors (FAST Search Server 2010 for SharePoint)