Article
10/04/2012

Word Breakers and Stemmers

Word breakers and stemmers perform linguistic analysis on all full-text indexed data. Linguistic analysis involves finding word boundaries (word-breaking) and conjugating verbs (stemming). Word breakers and stemmers are language specific, and the rules for linguistic analysis differ for different languages. For a given language, a word breaker identifies individual words by determining where word boundaries exist based on the lexical rules of the language. Each word (also known as a token) is inserted into the full-text index using a compressed representation to reduce its size. The stemmer generates inflectional forms of a particular word based on the rules of that language (for example, "running", "ran", and "runner" are various forms of the word "run").

Using language-specific word breakers enables the resulting terms to be more accurate for that language. Where there is a word breaker for the language family, but not for the specific sub-language, the major language is used. For example, the French word breaker is used to handle text that is French Canadian. If no word breaker is available for a particular language, the neutral word breaker is used. With the neutral word breaker, words are broken at neutral characters such as spaces and punctuation marks.

Word Breaker Registration

For the word breakers of a language to be used, they must be registered. For registered word breakers, associated linguistic resources—stemmers, noise words (stopwords), and thesaurus files—also become available to full-text indexing and querying operations. To view a list of the languages whose word breakers are currently registered with SQL Server, use the following Transact-SQL statement:

SELECT * FROM sys.fulltext_languages

If you add, remove, or alter a word breaker, you need to refresh the list of Microsoft Windows locale identifiers (LCIDs) that are supported for full-text indexing and querying. For more information, see How to: Alter the List of Registered Word Breakers and Filters (Transact-SQL).

Several licensed third-party word breakers are shipped with SQL Server 2008. You can manually load additional third-party word breakers (and stemmers) for several languages (Danish, Polish, and Turkish). These word breakers are not enabled by default because they are owned by third parties who have not yet provided the level of testing, security, and robustness that is required for them to be enabled by default. For more information, see How to: Load Licensed Third-Party Word Breakers.

Full-Text Language Option

For a localized version of SQL Server, SQL Server Setup sets the default full-text language option to the language of the server if an appropriate match exists. For a non-localized version of SQL Server, the default full-text language option is English.

When creating or altering a full-text index, you can specify a different language for each full-text indexed column. If no language is specified for a column, the default is the value of the configuration option default full-text language.

For more information, see default full-text language Option.

Note

All columns listed in a single full-text query function clause must use the same language, unless the LANGUAGE option is specified in the query. The language used for the full-text indexed column being queried determines the linguistic analysis performed on arguments of the full-text query predicates (CONTAINS and FREETEXT) and functions (CONTAINSTABLE and FREETEXTTABLE).

Choosing a Language When Full-Text Indexing a Column

When creating a full-text index, we recommend that you specify a language for each indexed column. If a language is not specified for a column, the system default language is used. The language of a column determines which word breaker and stemmer are used for indexing that column. Also, the thesaurus file of that language will be used by full-text queries on the column.

There are a couple of things to consider when choosing the column language for creating a full-text index. These considerations relate to how your text is tokenized and then indexed by Full-Text Engine. For more information, see Best Practices for Choosing a Language When Creating a Full-Text Index.

To view the word breaker language of a column

How to: View or Change the Properties of a Full-Text Index (SQL Server Management Studio)

sys.fulltext_index_columns (Transact-SQL)

SELECT 'language_id' AS "LCID" FROM sys.fulltext_index_columns;

Impact of New Word Breakers in SQL Server 2008

SQL Server 2008 includes word breakers for more than 50 diverse languages, of which 23 also exist in SQL Server 2005. Only the word breakers for English, Korean, Thai, and Chinese (all forms) remain the same. For other languages, SQL Server 2008 introduces a new generation of word breakers that have better linguistic rules and are more accurate than earlier word breakers. Potentially, the new word breakers might behave slightly differently from the word breakers in imported SQL Server 2005 full-text indexes. This is significant if a full-text catalog was imported when a SQL Server 2005 database was upgraded to SQL Server 2008. One or more languages used by the full-text indexes in the full-text catalog might now be associated with new word breakers. For more information, see Full-Text Search Upgrade.

Word Breaker Versions for Languages Supported in SQL Server 2005

Only the word breakers for English, Korean, Thai, and Chinese (all forms) remain the same. The following table lists the word breakers that existed in SQL Server 2005 and indicates whether they have been updated in SQL Server 2008. For a complete list of all the SQL Server 2008 word breakers, see sys.fulltext_languages (Transact-SQL).

Note

The word breakers for most languages are registered by default. However, a number of licensed third-party word breakers are disabled by default. For information about these languages and how to register these word breakers, see How to: Load Licensed Third-Party Word Breakers.

Language	LCID	Word breakers
Brazilian	1046	New
Chinese (Hong Kong SAR, PRC)	3076	Unchanged
Chinese (Macau SAR)	5124	Unchanged
Chinese (Singapore)	4100	Unchanged
Danish (disabled by default)	1030	Unchanged
Dutch	1043	New
English	1033	Unchanged
English (United Kingdom)	2057	Unchanged
French	1036	New
German	1031	New
Italian	1040	New
Japanese	1041	New
Korean	1042	Unchanged
Neutral	0	New
Polish (disabled by default)	1045	Unchanged
Portuguese	2070	New
Russian	1049	New
Simplified Chinese	2052	Unchanged
Spanish	3082	New
Swedish	1053	New
Thai	1054	Unchanged
Traditional Chinese	1028	Unchanged
Turkish (disabled by default)	1055	Unchanged

For a complete list of supported languages, see sys.fulltext_languages (Transact-SQL).

Word-Breaking Time-out Errors

A word-breaking time-out error might occur in a variety of situations. For information about these situations and how to respond in each situation, see MSSQLSERVER_30053.

Obtaining Information About Word Breakers

Viewing the Tokenization Result of A Word Breaker, Thesaurus, and Stoplist Combination

sys.dm_fts_parser (Transact-SQL).

To return information about the registered word breakers

sp_help_fulltext_system_components (Transact-SQL)

Word Breakers and Stemmers

Word Breaker Registration

Full-Text Language Option

Choosing a Language When Full-Text Indexing a Column

Impact of New Word Breakers in SQL Server 2008

Word Breaker Versions for Languages Supported in SQL Server 2005

Word-Breaking Time-out Errors

Obtaining Information About Word Breakers

See Also

Reference

Concepts

Additional resources