Linguistic features (FAST Search Server 2010 for SharePoint)

Article
07/22/2014

Applies to: FAST Search Server 2010

FAST Search Server 2010 for SharePoint has many linguistic features that help improve search relevancy. Some features can be tuned, and other features have a default behavior that cannot be changed. The table here describes the different linguistic features and their effect on relevance and recall.

Linguistic feature	Impact on relevance	Impact on recall	Description
Synonyms	No	Yes	Synonyms are a list of words attached to a keyword. Keywords are words or phrases that you have identified as typical terms within your organization. You attach synonyms to keywords to increase recall. When a search includes a synonym for a keyword, search items that contain the related keyword are also returned. Furthermore, if a search includes a keyword, search items that contain the synonym are also returned, regardless of whether they contain the keyword. Please be aware that this only applies when there is a complete match between the search words and any defined keywords or synonymous terms.
Stemming	Yes	Yes	Words can have multiple forms but basically mean the same thing. For example, the verb “to write” includes forms such as writing, wrote and writes. Similarly, nouns usually include singular and plural versions, such as book and books. The stemming feature in FAST Search Server 2010 for SharePoint can increase recall of relevant documents by mapping one form of a word to its variants. For languages that have many forms of the same word, stemming is very important to achieve sufficient recall. Stemming is applied to the content of managed properties where stemming is enabled. You cannot tune the stemming dictionaries.
Spell checking and spell tuning	Yes	No	The spell checking feature improves the quality of searches by comparing the search words against language specific dictionaries and identifying misspelled terms. If the dictionary contains a closely matching word with a significantly higher frequency, that word is suggested through the Did you mean? feature. You can fine-tunes the spell checking dictionaries to make sure that they are aligned with the frequency of words in the processed documents. Users will only get spell checking suggestions that are relevant within the processed content. You can also define spell checking exceptions. These are words that are not found in the default spell checking dictionary but that are still valid words. When a user types a search word that is included in the Spell checking exception list, the, Did you mean? feature will not suggest a correction for that word. The spell checking dictionary contributes to both increased recall and relevance because the feature prevents usage of misspelled words.
Anti-phrasing	Yes	No	Anti-phrasing refers to phrases for which there is no value in indexing. “Where can I find information about” is a typical anti-phrase for English. You cannot tune the anti-phrasing dictionaries.
Tokenization	Yes	Yes	The tokenization process splits a stream of text into individual words (tokens) that can be indexed. Spaces, tabs, periods, commas, dashes, question marks and quotation marks are considered delimiting characters. For East Asian languages, that is Chinese (Simplified and Traditional), Japanese and Thai, where spaces are not consistently used to separate words, tokenization is especially important for relevancy. Tokenization is performed on text content both during item processing and search processing. The tokenization process in FAST Search Server 2010 for SharePoint consists of three stages: Language-independent input normalization, where input text is transformed into a unified format. This includes replacing complex characters such as ligatures with their canonical forms, and replacing less-used Unicode characters by compatible characters or sequences of characters (Trademark sign with tm or full width Latin characters with half-width equivalents). A language-specific tokenization engine based on the document language splits the text into individual words/tokens based on word breaker tokenizers. Indexed tokens are normalized according to language-independent rules to ensure cross-language retrieval. The normalization reduces the complexity of a character by changing it or removing parts of it. In FAST Search Server 2010 for SharePoint, all characters are lowercased and accented characters are reduced to their unaccented base characters.

Automatic language and encoding detection

During item processing, FAST Search Server 2010 for SharePoint automatically recognizes more than 80 different languages in all common encodings. The text language and encoding can be defined in the metadata of a document, or it can be determined by an automatic process during item processing. The information is used to select the appropriate language-specific dictionaries and algorithms during item processing.

Property extraction

FAST Search Server 2010 for SharePoint provides advanced, language-specific property extractors for person names, company names and geographic names/locations.

For more information, see Manage property extraction (FAST Search Server 2010 for SharePoint).

Offensive content filtering

FAST Search Server 2010 for SharePoint can provide filtering against offensive content for many language. Offensive content filtering is not provided out of the box, but can be configured.

Optimizing relevancy for East Asian languages

As there is no fixed standard in different languages for what forms a separate token, different speakers have different ideas of what a token may be. For example, some users may consider 富士山(Mount Fuji) to be one token and other users may regard it as two tokens, 富士 (Fuji) and 山 (Mount).

The inconsistencies between what users consider being one token, and what the tokenizer module actually identifies as one token, can cause lower precision or lower recall, for example:

The Simplified Chinese tokenizer module splits the name 萨斯喀彻温 (Saskatchewan) into these tokens: 萨 (Sa), 斯 (s), 喀 (ka), 彻 (tche), 温 (wan).

A search for the name "Saska", 萨 (Sa), 斯 (s), 喀 (ka) will also retrieve a document that contains "Saskatchewan", which means that the precision is lower than may be expected.
The Japanese tokenizer module marks "サスカチュワンサスカトゥーン" (Saskatchewan Saskatoon), as one token.

A search for "Saskatchewan" will not retrieve a document that contains "Saskatchewan Saskatoon". Therefore, recall will be decreased.

FAST Search Server 2010 for SharePoint performs language specific tokenization based on automatic language detection for the indexed items and the end-user’s locale setting. However, you can influence the default tokenization using two methods: linguistic tokenization and substring tokenization.

Linguistic tokenization means that a string of text is split into individual tokens based on language-specific rules. For East Asian languages, you can influence tokenization by creating custom dictionaries. If words are missing from the system dictionary provided by FAST Search Server 2010 for SharePoint, for example technical terms, person names or company names, or if the default tokenization is incorrect, you can add words to the custom dictionary to ensure that they are tokenized as required.

Substring tokenization, also known as N-gram tokenization, is typically applied to managed properties that are considered difficult to tokenize automatically. Substring tokenization removes all spaces from the text and then splits it into bigrams (overlapping two character long tokens). For example, "アメリカ" (America) is split up into: ア,アメ,メリ,リカ (a, ame, meri, ca). Without substring tokenization enabled, a CJK query may, in certain cases, be tokenized incorrectly and therefore return a meager or empty result list. This will never occur if substring search is used, as all N-gram substrings of each token will be indexed, and also N-grams spanning token boundaries.

Substring tokenization is especially useful for applications where recall (the overall number of documents retrieved) is considered much more important than precision (high relevancy of the results). By using this feature, you will improve the recall, but may also reduce the precision and return more items than desired. Note that substring tokenization will have a significant effect on the size of the index for these managed properties. It is therefore not recommended that you use the feature on free-text, but may be considered for metadata that contains domain-specific product names, codes, and so on. To minimize the drop in precision, you can use a combination of substring tokenization and linguistic tokenization.

Linguistic features (FAST Search Server 2010 for SharePoint)

Automatic language and encoding detection

Property extraction

Offensive content filtering

Optimizing relevancy for East Asian languages

See also

Additional resources