The index schema and its features (FAST Search Server 2010 for SharePoint)
Published: March 31, 2011
The index schema controls which content parts are indexed, searchable, and available for viewing. The predefined index schema (delivered out of the box) has been tested and tuned to suit many common deployments. However, depending on the content you want to index and make available for viewing in your web front-end, you should consider the index schema strategy before you deploy a full-scale FAST Search Server 2010 for SharePoint farm. Make sure that you plan the overall index schema strategy before you start indexing large amounts of content. If not, you may have to re-index all the content for the changes to take full effect. It is possible to make incremental changes to the mapping without any service interruption or search downtime. However, it is very inconvenient to apply major changes after having indexed large amount of content.
If your deployment will index many million documents, we recommend that you first tune the index schema and the associated end-user search features on a smaller test installation that has a relevant subset of the content that you want to index.
The index schema
The following features are handled in the index schema:
Crawled properties are metadata that is extracted from content sources when you run a crawler or connector. Metadata can be structured content such as the title or the author from a Word document, or unstructured content based on the content of a search item such as a detected language or extracted keywords. A crawled property is uniquely defined by Name, Propset and VariantType. Crawled properties get sorted into categories, which is a high-level grouping based on the iFilter and Protocol Handler (given by the Indexing connector used and data source) used to extract the metadata from the content. Examples of categories are as follows:
Business Data – Metadata associated with content in the Business Data Catalog.
Mail – Metadata associated with Microsoft Exchange Server.
Office – Metadata contained in Microsoft Office documents such as Word, Excel and PowerPoint.
People – Metadata associated with the people profiles in SharePoint. The majority of these are also mapped to various managed properties from Active Directory and SharePoint information.
Web – HTML metadata associated with web pages.
A number of crawled properties contain metadata that is irrelevant or may adversely affect the search relevance. It generally does not make sense to add numeric values to the full-text index. You would also not include crawled properties that have sensitive content, or crawled properties that have content that contains words and phrases that occur in most of the documents in the full-text index. These should be excluded from searching. The default index schema already provides several excluded crawled properties that work well for common content formats. However, because the discovery of crawled properties is content driven, additional properties will be discovered that you must consider when you tune the index schema for relevance.
There are several reasons to for excluding the content of crawled properties to improve both relevance and recall. Some crawled properties might be searchable through a managed property mapping, and do not have to be searchable through the default full-text index also.
For more information about how to exclude content of crawled properties, see Exclude crawled properties from searching by using Windows PowerShell (FAST Search Server 2010 for SharePoint).
To make metadata searchable, you map one or more crawled properties to a managed property. The crawled properties will contain a large amount of different metadata properties. A key phase of your deployment planning is to determine the mapping of these crawled properties to managed properties. The default index schema provides default mappings that are adapted to common content formats. As you optimize the system for relevance, look at the quality of the content in managed properties, determine whether there are other crawled properties that have better quality for your content, and if so, update the mappings. To make it easier to test your changes, perform an initial tuning of the crawled property mapping on a test installation that has a limited amount of content.
For more information about how to map crawled properties, see Map a crawled property to a managed property by using Windows PowerShell.
In the simplest form, a search index can contain the searchable representation of the body and title of a document. However, you will quickly experience the power of mapping and indexing the various metadata of your content sources. By using the FAST Search Server 2010 for SharePoint schema administration services, you can explore the actual crawled properties of the content sources and decide a mapping to managed properties. You will then be able to assign features to the managed properties that enhances the end-user experience when they make their queries, such as defining refiners and alphabetical sorting of the search results.
For more information about how to define a refiner, see Define a refiner for a managed property by using Windows PowerShell.
For more information about how to sort search results alphabetically, see Set a managed property as sort criteria by using Windows PowerShell.
A full-text index will typically contain a set of managed properties that represents the content of the item that you are querying. This includes the body of the item, the title, the URL, and so on. The default full-text index is named content.
When a search is performed, it searches in a full-text index. Managed properties are mapped to a full-text index to enable you to search across several managed properties at the same time. The index schema manages the mapping of managed properties into full-text indexes. With this mapping, the managed properties are assigned an importance level between 1 and 7, where 7 is the most important. The importance level represents the perceived importance of the managed property within the full-text index as related to drilling. Drilling works with a StopWordThreshold to ensure that the most relevant search items are returned first when the StopWordThreshold is reached on a search against a full-text index (see later in this article for more information about the StopWordThreshold).
For example, the StopWordThreshold is set at 2,000,000, and you have the following managed properties with the different importance levels in parenthesis:
When a search is performed, the first level that is used in rank computation includes all managed properties from levels 1 and up to 7. So in our example, all the listed managed properties would be included. If the number of matching search items is larger than 2,000,000 (the StopWordThreshold), the system automatically retries the search by drilling to the next importance level. By drilling to importance level 2, only managed properties from levels 2 and up will be used in the rank computation. In our example, this would mean that the managed property body is not included. If, after drilling to importance level 2, the number of matching search items is still larger than 2,000,000, the system will drill to level 3. In our example, the managed properties body and Author would not be included in rank computation. It is important to be aware that all matching search items will appear in the search results list. However, search items that are not part of the rank computation will appear in the search results list after all search items included in the rank computation.
You can tune and improve relevance by separating managed properties that map to the same full-text index into different importance levels. For more information about how to do this task, see Map a managed property to a full-text index at a specific importance level by using Windows PowerShell (FAST Search Server 2010 for SharePoint).
In certain cases you may want to define multiple full-text indexes for different kinds of queries or for different applications. Although this gives a large amount of flexibility, it will have a certain performance cost for disk space and use of system resources such as file descriptors. We therefore recommend that you do not define more than 10 full-text indexes inside an index schema.
The index schema discovers new crawled properties and automatically make their content available for searching. The contents of new crawled properties are mapped into the default full-text index at the lowest importance level (1). Therefore, you do not have to have prior knowledge about the crawled properties or create mappings to managed properties to make the contents of newly crawled properties searchable.
A rank profile is a sort criterion that is used by FAST Search Server 2010 for SharePoint. Rank configuration is stored in a rank profile. Compared to sort criteria like name or date, the rank profile is a more complex structure used to calculate the rank score. The rank profile is configured to calculate the rank score with input from quality rank components (also known as static rank components) and dynamic rank components. You can create new rank profiles or modify the default rank profile to optimize search results based on your kind of content. You can also set the different rank profiles as sort criteria on the search results page. This gives the user more options for filtering the search result. The default rank profile is named default. It is also used as the default sort criteria (Sort by Relevance) on the search result page. The final rank score calculation is complex. But the ability to configure rank profiles gives you a way to control the rank factors that you consider important for the system. Customizing the rank profiles and creating new rank profiles will have small effect on static system resources like disk and memory.
For more information about the rank profile, see Manage rank profiles (FAST Search Server 2010 for SharePoint).
Rank profile changes will mainly affect query performance as outlined in the following list.
Stop Word Threshold. This is an important parameter to avoid that queries for very common words take too much resources to evaluate. The number of matching search items used in the rank computation is determined by the StopWordThreshold attribute in the rank profile. If the number of matching items for a search item is larger than the StopWordThreshold for the rank profile being used, the FAST Search Server 2010 for SharePoint system automatically retries the search by limiting the managed properties searched. This process is known as drilling.
The value of the StopWordThreshold is relative to a full-text index size of 10,000,000 items. In the default rank profile the StopWordThreshold value is 2,000,000, which means that drilling is performed when a search word is found in more than 20% of items in the full-text index (2,000,000 divided by 10,000,000).
For more information about how to change the Stop Word Threshold, see Change the Stop Word Threshold by using Windows PowerShell (FAST Search Server 2010 for SharePoint).
Managed property field boost. This is an efficient way to achieve targeted relevance boost for documents that have managed properties that have certain values. For example, if you want to give additional rank points to indexed items of a specific type, such as Word documents, you would give search items with the file extension .docx additional rank points. Each managed property field boost setting will add to the evaluation time for all queries. Therefore, be careful not to define too many such boosts within the same rank profile. It is better to define multiple rank profiles with targeted managed property boost setting.
For more information about how to change the Managed property field boost, see Change the managed property field boost weight by using Windows PowerShell (FAST Search Server 2010 for SharePoint).