Manage crawl rules (FAST Search Lotus Notes connector)

 

Applies to: FAST Search Server 2010

The FAST Search Lotus Notes connector offers a number of ways to specify which Domino content to crawl.

Available filtering mechanisms

To add filters for which databases to crawl, you can:

  • Explicitly list one or more databases to crawl;

  • Specify one or more regular expressions to include databases in search;

  • Specify one or more regular expressions to exclude databases from search.

To add filters to select which Lotus Notes documents to crawl, you can:

  • Specify a database view to use the search formula of that view to select which Lotus Notes documents in the database to crawl;

  • Specify a search formula to select which Lotus Notes documents in the database to crawl.

To add filters to select which attachments to crawl, you can:

  • List which attachment document extensions are to be crawled (for example ppt, doc, xls and pdf);

  • List which attachment document extensions are to be excluded from the crawl (for example exe, jpg and zip);

  • Specify a maximum size for attachments to be crawled.

The parameter group Filters in the FAST Search Lotus Notes content connector configuration file is used to specify filters. The table below lists the parameters from the configuration file and how they are used. For a full description and example values, refer to the technical reference.

Filter category Parameter Short description

Databases

DataBase

Semicolon separated list of database paths to be crawled. These paths need to match the path (and file name) of the database on the Domino server exactly.

This parameter may contain views. To include a view, append the view to the database path, separating it with a #.

If a view is specified, only the search formula of that view is used. If no view is specified, the default database view is used. If there is no default view, all Lotus Notes documents will be crawled.

Databases

IncludeDatabases

Semicolon separated list of one or more include filters, regular expressions, for database names/paths.

Use when the DataBase parameter is empty, and in combination with the ExcludeDatabases filter.

Databases

ExcludeDatabases

Semicolon separated list of one or more exclude filters, regular expressions, for database names/paths.

Use when the DataBase parameter is empty, and in combination with the IncludeDatabases filter.

This parameter overrides the include filter; use it to narrow included content.

Lotus Notes documents

ViewName

The name of a view to be used for all databases that are crawled. If the view cannot be found in a database, all documents will be crawled.

To use different views for different databases, use the DataBase parameter and list all the databases to be indexed along with the view name for each.

Lotus Notes documents

SearchQuery

A search formula written in the Notes Query Language that will be used for all databases that do not have the view specified in the ViewName parameter AND do not have a default view.

Attachments

IncludeAttachmentExtensions

Semicolon separated list of filename extensions for attachments that should be crawled.

Attachments

ExcludeAttachmentExtensions

Semicolon separated list of filename extensions for attachments that should NOT be crawled. This parameter is ignored if the IncludeAttachmentExtensions parameter is set.

Attachments

MaxAttachmentSize

Maximum size of attachments (in KB). All attachments that are larger than this size will not be crawled. Use this parameter to increase performance and/or to prevent the connector from sending documents to the search engine that are too large.

Considerations when modifying the filters

Be careful when you modify any of the filters after the initial crawl has taken place. This because when the connector performs incremental crawls, it will only pick up databases and Lotus Notes documents that have been modified since the previous time it ran.

Warning

You can increase the number of items the content connector crawls, but not reduce the number of items in the crawl.

Scenario 1 describes what happens if you try to reduce the number of items to be crawled by modifying the filters after the initial run, resulting in an incorrect behavior. Scenario 2 describes an example how a filter change can increase the number of items in the crawl, resulting in correct behavior.

Scenario 1

You run the connector with a DataBase parameter that lists three databases to be crawled: Database1.nsf;Database2.nsf;Database3.nsf. Each of these databases contains 1000 Lotus Notes documents, so 3000 items are crawled in total.

Now, you remove Database3.nsf from the DataBase parameter before an incremental run of the connector takes place. The connector will not detect that the parameter has changed, and will not detect what the value of the parameter was before the change. The connector will crawl Database1.nsf and Database2.nsf incrementally, detecting changes in these databases. The 1000 notes from Database3.nsf will remain unmodified in the search engine. Unless you re-add Database3.nsf to the configuration, the Lotus Notes documents in this database will remain unmodified in the search engine forever. This is the case even if you delete the entire database from the Domino server, which will lead to erroneous search results.

This applies to all of the parameters in the Filter group of the configuration file.

If you need to reconfigure the connector to no longer crawl content that you previously crawled, follow this procedure:

  1. Make the necessary modifications to your FAST Search Lotus Notes content connector configuration file and set the configuration parameter StateTracker/PurgeAtStart to true.

  2. Clear the contents of the content collection that you are feeding to, as specified in the configuration parameter FASTSearchSubmit/Collection.

    Warning

    This will delete all the items in the content collection, possibly including content from other sources, like the shared SharePoint indexing connector(s), the FAST Search Web crawler and/or the FAST Search database connector.

    1. On the Start menu, click All Programs.

    2. Click Microsoft FAST Search Server 2010 for SharePoint.

    3. Click the Microsoft FAST Search Server 2010 for SharePoint shell.

    4. At the Microsoft FAST Search Server 2010 for SharePoint shell command prompt, type the following command:

      Clear-FASTSearchContentCollection -Name <collection name>

      Where:

      <collection name> is the content collection you are about to clear.

      Wait for the command to finish. This may take some time.

  3. Run the FAST Search Lotus Notes content connector to re-feed all the documents that match the new configuration. Because you set StateTracker/PurgeAtStart to true, the connector will not perform an incremental crawl, but instead purge its state and run a full crawl. This may take a long time, depending on how much content is to be crawled.

  4. When the full crawl has finished, set the parameter StateTracker/PurgeAtStart to false in order to activate incremental crawls.

Important

After clearing the content collection, all the relevant content will have been removed from the search engine. It will not be searchable until the FAST Search Lotus Notes content connector is rerun.

To reduce the impact of having to delete all content from the search engine, consider to split the connector configuration into multiple configurations, where each connector configuration uses a separate content collection. This will reduce the amount of content that is (temporarily) deleted from the search engine. When deciding how to break up the configuration, use the following guidelines:

  • If you have Lotus Notes databases that you know will never get deleted, crawl them with a separate configuration (you should never need to perform the procedure above for this configuration)

  • If you have Lotus Notes databases that are more important to always have searchable than others, crawl them with a separate configuration and keep the amount of content for this configuration reasonably small so that the time that the content is not searchable is minimized. For example, if you are crawling a thousand e-mail databases, and it is important that this content is searchable, consider crawling the mailboxes with ten different configurations (each crawling one hundred databases). That way, when a user quits the company, only one hundred users are affected by this, and their search-downtime will be shorter.

  • Even if you conclude that it is OK to have a single configuration for the FAST Search Lotus Notes connector, you should consider having one separate content collection for Lotus Notes content and one for other content (for example SharePoint sites, databases and Web) so that the procedure above does not delete the content from those sources.

Scenario 2

You run the connector with a DataBase parameter that lists 3 databases to be crawled: Database1.nsf;Database2.nsf;Database3.nsf. Each of these databases contains 1000 Lotus Notes documents, so 3000 items are crawled in total.

Now, you add Database4.nsf to the DataBase parameter before an incremental run of the connector occurs. The connector will not detect that the parameter value has changed, but will crawl all four databases and detect any changes to databases Database1.nsf, Database2.nsf and Database3.nsf. In addition, it will crawl and index all content from Database4.nsf. This is because the connector has no state for this database from a previous run. In the next incremental run, all four databases will be crawled incrementally.