About linguistic features (FAST Search Server 2010 for SharePoint)

 

Applies to: FAST Search Server 2010

Microsoft FAST Search Server 2010 for SharePoint has many linguistic features that help improve search relevancy. Some features can be tuned, but other features have a default behavior that cannot be changed.

The following linguistic features are described in this article:

  • Tokenization

  • Automatic language detection

  • Stemming

  • Spell checking and spell tuning

  • Anti-phrasing

  • Property extraction

  • Offensive content filtering

For an overview of the supported languages for these linguistic features, see Linguistic features per language (FAST Search Server 2010 for SharePoint).

Tokenization

Tokenization is segmentation of text into individual words (tokens) that can be indexed. Spaces, tabs, periods, commas, dashes, question marks and quotation marks are considered delimiting characters. For East Asian languages (Chinese, Japanese), which do not have these delimiting characters between words, more sophisticated methods must be employed to produce the indexable tokens.

Tokenization is performed on text content both during item processing and query processing. The tokenization process in FAST Search Server 2010 for SharePoint consists of three stages:

  1. Language-independent input normalization, where input text is transformed into a unified format. This includes replacing complex characters such as ligatures with their canonical forms, and replacing less-used Unicode characters by compatible characters or sequences of characters (Trademark sign with tm or full width Latin characters with half-width equivalents).

  2. A language-specific tokenization engine based on the document language splits the text into individual words/tokens based on word breaker tokenizers.

  3. Indexed tokens are normalized according to language-independent rules to ensure cross-language retrieval. The normalization reduces the complexity of a character by changing it or removing parts of it. In FAST Search Server 2010 for SharePoint, all characters are lowercased and accented characters are reduced to their unaccented base characters.

Tokenization is supported for all languages.

Automatic language and encoding detection

During item processing, FAST Search Server 2010 for SharePoint automatically recognizes more than 80 different languages in all common encodings. The text language and encoding can be defined in the metadata of a document, or it can be determined by an automatic process during item processing.

The information is used to select the appropriate language-specific dictionaries and algorithms during item processing.

Stemming

Stemming merges multiple forms of the same word, for example the singular and plural forms of a noun. Stemming increases recall, and for languages that have many forms of the same word, stemming is very important to achieve sufficient recall. You cannot tune the stemming dictionaries.

Spell checking and spell tuning

Spell checking improves the quality of queries by comparing the query terms against language specific dictionaries and identifying misspelled terms.

Spell tuning fine-tunes the spell checking dictionaries to make sure that they are aligned with the frequency of words in the processed documents. Users will only get spell checking suggestions that are relevant within the processed content. Without this alignment, spell checking suggestions could lead to zero hit result sets.

You can define words to exclude from spell checking, for example a specific product or company name. The exclusion list is used for all languages.

Anti-phrasing

Anti-phrasing is closely related to the concept of stop words, which are words that the search system ignores in end-user queries. The anti-phrasing feature does not remove single words, but complete phrases. Removing single words implies the risk of removing important words that are identical to stop words. Phrases are less ambiguous and can be removed from the query more safely. The anti-phrasing dictionaries that are delivered with FAST Search Server 2010 for SharePoint therefore do not contain single words. You cannot tune the anti-phrasing dictionaries.

Property extraction

FAST Search Server 2010 for SharePoint provides advanced, language-specific property extractors for person names, company names and geographic names/locations.

For more information, see Manage property extraction (FAST Search Server 2010 for SharePoint).

Offensive content filtering

FAST Search Server 2010 for SharePoint can provide filtering against offensive content for many languages.

Offensive content filtering is not provided out of the box, but can be configured.

See Also

Concepts

Linguistic features per language (FAST Search Server 2010 for SharePoint)
Spell tuning cmdlets (FAST Search Server 2010 for SharePoint)