Best Practices for Choosing a Language When Creating a Full-Text Index

When creating a full-text index, you need to specify a column-level language for the indexed column. The word breaker and stemmers of the specified language will be used by full-text queries on the column. There are a couple of things to consider when choosing the column language when creating a full-text index. These considerations relate to how your text is tokenized and then indexed by Full-Text Engine.

Note

To specify a column-level language for a column of full-text index, use the LANGUAGE language_term clause when specifying the column. For more information, see CREATE FULLTEXT INDEX (Transact-SQL) and ALTER FULLTEXT INDEX (Transact-SQL).

Word Breakers

A word breaker tokenizes the text being indexed on word boundaries, which are language specific. Therefore, word-breaking behavior differs among different languages. If you use one language, x, to index a number of languages {x, y, and ,z}, some of the behavior might cause unexpected results. For example, a dash (-) or a comma (,) might be a word-break element that will be thrown away in one language but not in another. Also rarely unexpected stemming behavior might occur because a given word might stem differently in different language. For example, in the English language, word boundaries are typically white space or some form of punctuation. In other languages, such as German, words or characters may be combined together. Therefore, the column-level language that you choose should represent the language that you expect will be stored in rows of that column.

Western Languages

For the Western family of languages, if you are unsure which languages will be stored in a column or you expect more than one to be stored, a general workaround is to use the word breaker for the most complex language that might be stored in the column. For instance, you might expect to store English, Spanish and German content in a single column. These three Western languages possess very similar word-breaking patterns, with the German patterns being the most complex. Therefore, a good choice is this case would be to use the German word breaker, which should be able to process English and Spanish text correctly. In contrast, the English word breaker might not process German text perfectly because of the compound words of German.

Note that using the word breaker of the most complex language in a language family does not guarantee perfect indexing of every language in the family. Corner cases might exist in which the most complex word breaker cannot correctly handle text written in another language.

Non Western Languages

For non Western languages (such as Chinese, Japanese, Hindi, and so forth) the above workaround does not necessarily work, for linguistic reasons. For non Western languages, consider one of the following workarounds:

  • For languages from different families

    If a column might contain dramatically different languages, for example, Spanish and Japanese, consider storing the content of different languages in separate columns. This would allow you to use the language-specific word breaker for each column. If you choose this solution and you don't know the query language at query time, you might need to issue the query against both columns to ensure that the query finds the right row or document.

  • For Binary content (such as Microsoft Word documents)

    When the indexed content is of binary type, the full-text search filter that processes the textual content before sending it to the word breaker might honor specific language tags existing within the binary file. In this case, at indexing time, the filter will emit the right LCID for a document or section of a document. The Full-Text Engine will then call the word breaker for the language with that LCID. However, after indexing multi language content, we recommend that you verify that the content was correctly indexed.

  • For plain text content

    When your content is plain text, you can convert it to the xml data type and add language tags that indicate the language corresponding to each specific document or document section. For this to work, however, you need to know the language before full-text indexing.

Stemming

An additional consideration when choosing your column-level language is stemming. Stemming in full-text queries is the process of searching for all stemmed (inflectional) forms of a word in a particular language. When you use a generic word breaker to process several languages, the stemming process works only for the language specified for the column, not for other languages in the column. For example, German stemmers do not work for English or Spanish (and so forth). This might affect your recall depending of which language you choose at query time.

Another consideration in language choice is related to how the data is represented. For data that is not stored in varbinary(max) column, no special filtering is performed. Rather, the text is generally passed through the word breaking component as-is.

Also, word breakers are designed mainly to process written text. So, if you have any type of markup (such as HTML) on your text, you may not get great linguistic accuracy during indexing and search. In that case, you have two choices—the preferred method is simply to store the text data in varbinary(max) column, and to indicate its document type so it may be filtered. If this is not an option, you may consider using the neutral word breaker and, if possible, adding markup data (such as 'br' in HTML) to your noise word lists.

Note

Language based stemming does not come into play when you specify the neutral language.

Specifying a Nondefault Column-Level Language in a Full-text Query

By default, in SQL Server 2008, full-text search will parse the query terms using the language specified for each column that is included in the full-text clause. To override this behavior, specify a nondefault language at query time. For supported languages whose resources are installed, the LANGUAGE language_term clause of a CONTAINS, CONTAINSTABLE, FREETEXT, or FREETEXTTABLE query can be used to specify the language used for word breaking, stemming, thesaurus, and stopword processing of the query terms.

Language Support

This section provides an introduction to word breakers and stemmers, and discusses how full-text search uses the LCID of the column-level language.

Introduction to Word Breakers and Stemmers

SQL Server 2008 includes a complete new family of word breakers and stemmers that are significantly better than the those previously available in SQL Server.

Note

The Microsoft Natural Language Group (MS NLG) implemented and supports these new linguistic components.

The new word breakers provide the following benefits:

  • Robustness

    Testing has shown that the new word breakers are robust in high-pressure query environments.

  • Security

    The new word breakers are enabled by default in SQL Server 2008 thanks to security improvements in linguistic components. We highly recommend that external components such as word breakers and filters be signed to improve the overall security and robustness of SQL Server. You can configure full-text to verify that these components are signed as follows:

    EXEC sp_fulltext_service 'verify_signature';
    
  • Quality

    Word breakers have been redesigned, and testing has shown that the new word breakers provide better semantic quality than previous word breakers. This increases the recall accuracy for SQL Server 2008 users.

  • Coverage

    for a vast list of languages, word breakers are included in SQL Server 2008 out of the box and enabled by default .

For a list of the languages for which SQL Server 2008 includes a word breaker and stemmers, see sys.fulltext_languages (Transact-SQL)

Third Party Word Breakers

Several external, third party word breakers are available for SQL Server 2008. You can manually register these word breakers. For more information, see How to: Load Licensed Third-Party Word Breakers.

After you load external (third party) word breakers, you need to refresh the list of full-text supported languages by executing the sp_fulltext_service system stored procedure specifying the update_languages option, as follows:

EXEC sp_fulltext_service 'update_languages';

Then, the languages of the newly loaded word breakers will be listed by the sys.fulltext_languages catalog view.

How Full-Text Search Uses the Name of the Column-Level Language

When creating a full-text index, you need to specify a valid language name for each column. If a language name is valid but not returned by the sys.fulltext_languages (Transact-SQL) catalog view, full-text search falls back to the closest available language name of the same language family, if any. Otherwise, full-text search falls back to the Neutral word breaker. This fall-back behavior might affect the recall accuracy. Therefore we strongly recommend that you specify a valid and available language name for each column when creating a full-text index.

Note

The LCID is used against all data types eligible for full-text indexing (such as char or nchar). If you have the sort order of a char, varchar, or text type column set to a language setting different from the language identified by the LCID, the LCID is used anyway during full-text indexing and querying of those columns.