International Considerations for Full-Text Search

Choosing a Language When Creating a Full-Text Index

There are a couple of things to consider when choosing the column language when creating a full-text index. These considerations relate to how your text is tokenized and then indexed by Microsoft full-text engine for SQL Server (MSFTESQL). A word breaker tokenizes the text being indexed on word boundaries. These word boundaries, in the English language, are typically white space or some form of punctuation. In other languages, such as German, words or characters may be combined together; therefore, your choice of a column-level language should represent the language you expect will be stored in rows of that column. If you are unsure, a general best bet is to use the neutral word breaker, which performs its tokenization purely on white space and punctuation. An additional benefit of your column-level language choice is "stemming". Stemming in full-text queries is defined as the process of searching for all stemmed (inflectional) forms of a word in a particular language.

Another consideration in language choice is related to the way in which the data is represented. For data not stored in varbinary(max) column, no special filtering is performed. Rather, the text is generally passed through the word breaking component as-is. Word breakers are designed mainly to process written text. So, if you have any type of markup (such as HTML) on your text, you may not get great linguistic accuracy during indexing and search. In that case, you have two choices—the preferred method is simply to store the text data in varbinary(max) column, and to indicate its document type so it may be filtered. If this is not an option, you may consider using the neutral wordbreaker and, if possible, adding markup data (such as 'br' in HTML) to your noise word lists.

Note

Language based stemming does not come into play when you specify the neutral language.

Language Support

In Microsoft SQL Server 2005, full-text queries can use languages other than the default language for the column to search full-text data. As long as the language is supported and its resources are installed, the language specified in the LANGUAGE language_term clause of the CONTAINS, CONTAINSTABLE, FREETEXT, and FREETEXTTABLE query will be used for word breaking, stemming, and thesaurus and noise-word processing.

The following table shows the language in which the full-text index data is stored. The language is based on the Unicode collation locale identifier selected during Microsoft SQL Server Setup.

Unicode collation locale identifier Language for full-text data storage

Chinese Bopomofo (Taiwan)

Traditional Chinese

Chinese Punctuation

Simplified Chinese

Chinese Stroke Count

Simplified Chinese

Chinese Stroke Count (Taiwan)

Traditional Chinese

Dutch

Dutch

English UK

English UK

French

French

General Unicode

English US

German

German

German Phonebook

German

Italian

Italian

Japanese

Japanese

Japanese Unicode

Japanese

Korean

Korean

Korean Unicode

Korean

Spanish (Spain)

Spanish

Swedish/Finnish

Swedish

All other Unicode collation locale identifier values that are not in this list get mapped to the neutral language word-breaker and -stemmer, which uses white spaces to delimit words.

Note

The Unicode collation locale identifier setting is used against all data types eligible for full-text indexing (such as char, nchar, and so on). If you have the sort order of a char, varchar, or text type column set to a language setting different from the Unicode collation locale identifier language, the Unicode collation locale identifier is still used during full-text indexing and querying of the char, varchar, and text type columns.

See Also

Other Resources

Full-Text Search Concepts
Data Types (Transact-SQL)
CONTAINS (Transact-SQL)
CONTAINSTABLE (Transact-SQL)
FREETEXT (Transact-SQL)
FREETEXTTABLE (Transact-SQL)

Help and Information

Getting SQL Server 2005 Assistance