Plan to deploy FAST Search specific connectors (FAST Search Server 2010 for SharePoint)

 

Applies to: FAST Search Server 2010

Before you start crawling content with the FAST Search specific connectors for Lotus Notes, database or Web content, there are a number of considerations to take into account. By first identifying the search needs and the organization of the content sources you plan to crawl, and collect important information, you will be able to set up the connectors more efficiently.

Consider creating several content collections

FAST Search Server 2010 for SharePoint uses a content collection to feed crawled content to the FAST Search Server 2010 for SharePoint index. You can choose to feed all the content to the same content collection, which defaults to sp.

However, sometimes it is necessary to remove content from a particular FAST Search specific connector configuration from the content collection, which requires that the content collection is cleared. This is done by running a Windows PowerShell command that clears all content from the particular content collection. In that case you will have to re-crawl all the content sources that feed to that particular content collection, which can be time consuming.

Consider creating separate content collections for your FAST Search specific connector configuration if you expect that the content from a particular content collection will have to be deleted at some point in the future. If you create different content collections for each or some FAST Search specific connector configuration(s), you avoid losing content from the index and prevent having to re-crawl any other content sources that feed to the same collection.

Important

Creating an additional content collection uses memory. You should consider an additional memory consumption of 500 MB per content collection per running indexer node. For example, in an environment with two indexers and two content collections, the total memory consumption will be 2x(2X500MB)=2000MB, or 2 GB, divided across all indexers.

Before implementing the FAST Search Lotus Notes connector, consider:

  • Which Domino servers hold data to be crawled and indexed.

  • Which database(s) on each server should be crawled and indexed.

  • Which database view (if any) should be used to select which Lotus Notes documents are to be made searchable.

  • Which attachment types (for example: pdf, txt, doc, nsf, ppt) should or should not be made searchable.

  • Which accounts can be used to access the data with sufficient permissions.

  • What database metadata should be made searchable.

  • What the total number of Lotus Notes documents to be crawled and indexed is.

    • The total number of Lotus Notes documents to be crawled affects the type of SQL Server database that should be used to track the state data for the connector. The maximum size for a SQL Server Express database is about 4 GB. This amounts to circa 2 million Lotus Notes items. If you are planning to crawl more items, use SQL Server 2008 Enterprise (or newer).

    • Instructions on how to change to a different state tracking database can be found in Configure the FAST Search Lotus Notes connector.

  • How often the connector should run, for example daily, weekly or monthly.

    • Different types of Lotus Notes content may have different demands for freshness. For example, e-mail databases may have to be crawled daily whereas archive databases can be crawled on a weekly (or even monthly) basis. In that case, you should create separate configurations for each of these content sets.
  • If several separate connector configurations must be created. There can be various reasons to split the crawl into two or more configurations and then schedule each configuration separately:

    • Different demands for freshness;

    • Different include/exclude rules for different content. For example, do not index Lotus Notes databases attached to e-mails, but index Lotus Notes databases attached to project databases;

    • Different property mappings for different parts of the content. For example, the property “last modified” in e-mails may be a time stamp indicating WHEN it was last modified while the property “last modified” in a project database may be a string indicating WHO modified the document last. These two properties cannot be indexed in the same managed property, because they are of different type. Therefore, a separate property mapping is needed for the two. This is done by configuring different configurations.

    • The need to feed different content to separate content collections in the search engine. If it is likely that you will delete one configuration (and the corresponding contents) sometime in the future, use a separate content collection for that configuration. This lets you delete all the indexed content for that configuration by deleting the content collection, without losing any other crawl data for different configurations. Refer to the section Considerations when modifying the filters in the article Manage crawl rules (FAST Search Lotus Notes connector) to decide whether you want to use separate content collections or not.

  • Whether you want to use item status tracking or not.

    • The item status tracking feature monitors when each item was last crawled and the status of that crawl, including any error messages. This information is stored in a database table that is not emptied automatically. Depending on the number of items in each crawl and how often a crawl is run, this table can become quite big.

      You can choose to turn off item status tracking by setting the FAST Search Lotus Notes content connector configuration file parameter ConnectorExecution/EnableStatustracker to false. Alternatively, you can manually delete the contents from the connectors.statustracker table at regular intervals when the connector is not running.

Before implementing the FAST Search database connector, consider:

  • Which servers contain databases with data to be crawled and indexed.

  • Which databases/tables hold data to be crawled and indexed.

  • Be prepared to formulate (or have a database owner formulate) an SQL query for each data source that is to be crawled and indexed. Database client tools can be used as an aid to this.

  • For each SQL query, determine which column(s) should serve as the document identifier.

  • Determine if the result from the SQL query will return rows with the same identifier that should be merged into one FAST Search Lotus Notes connector document.

  • How incremental updates should be configured. Find out if the database includes a time stamp column that can be used to find updated rows or if such a column can be added. If this column is not present and/or you cannot add such a column, you can use a checksum or flag based approach.

  • Whether you want to use item status tracking or not.

    • The item status tracking feature monitors when each item was last crawled and the status of that crawl, including any error messages. This information is stored in a database table that is not emptied automatically. Depending on the number of items in each crawl and how often a crawl is run, this table can become quite big.

      You can choose to turn off item status tracking by setting the FAST Search database connector configuration file parameter ConnectorExecution/EnableStatusTracker to false. Alternatively, you can manually delete the contents from the connectors.statustracker table at regular intervals when the connector is not running.

Before implementing the FAST Search Web crawler, consider:

  • How many crawler configurations you intend to use. Consider using multiple FAST Search Web crawler configurations when you have multiple independent Web sites to crawl that require different crawl configuration rules. Note that in many cases it may be possible to use sub collections instead.

  • How many FAST Search Web crawler servers you intend to use. Consider using multiple FAST Search Web crawler servers (known as a multi-node Web crawler) when you have a large number of Web sites to crawl. Note that a single DNS domain (for example *.contoso.com) can only be crawled by a single FAST Search Web crawler server.

  • Which Web sites hold the data to be crawled and indexed.

  • Think about the number of Web sites to be crawled, how quickly content should be requested and how often the Web sites should be re-visited.

  • What Web sites or part of Web sites should not be indexed. For example, CVS repository front ends or other systems that have a Web front-end that may pollute the index with unwanted data, or put a load on the Web server that it is not scaled to handle.

  • Which Web sites require authentication, what kind of authentication is required, and what accounts can be used to access the data with sufficient permissions.