A word breaker tokenizes the text being indexed on word boundaries, which are language specific. Therefore, word-breaking behavior differs among different languages. If you use one language, x, to index a number of languages {x, y, and ,z}, some of the behavior might cause unexpected results. For example, a dash (-) or a comma (,) might be a word-break element that will be thrown away in one language but not in another. Also rarely unexpected stemming behavior might occur because a given word might stem differently in different language. For example, in the English language, word boundaries are typically white space or some form of punctuation. In other languages, such as German, words or characters may be combined together. Therefore, the column-level language that you choose should represent the language that you expect will be stored in rows of that column.
Western Languages
For the Western family of languages, if you are unsure which languages will be stored in a column or you expect more than one to be stored, a general workaround is to use the word breaker for the most complex language that might be stored in the column. For instance, you might expect to store English, Spanish and German content in a single column. These three Western languages possess very similar word-breaking patterns, with the German patterns being the most complex. Therefore, a good choice is this case would be to use the German word breaker, which should be able to process English and Spanish text correctly. In contrast, the English word breaker might not process German text perfectly because of the compound words of German.
Note that using the word breaker of the most complex language in a language family does not guarantee perfect indexing of every language in the family. Corner cases might exist in which the most complex word breaker cannot correctly handle text written in another language.
Non Western Languages
For non Western languages (such as Chinese, Japanese, Hindi, and so forth) the above workaround does not necessarily work, for linguistic reasons. For non Western languages, consider one of the following workarounds:
-
For languages from different families
If a column might contain dramatically different languages, for example, Spanish and Japanese, consider storing the content of different languages in separate columns. This would allow you to use the language-specific word breaker for each column. If you choose this solution and you don't know the query language at query time, you might need to issue the query against both columns to ensure that the query finds the right row or document.
-
For Binary content (such as Microsoft Word documents)
When the indexed content is of binary type, the full-text search filter that processes the textual content before sending it to the word breaker might honor specific language tags existing within the binary file. In this case, at indexing time, the filter will emit the right LCID for a document or section of a document. The Full-Text Engine will then call the word breaker for the language with that LCID. However, after indexing multi language content, we recommend that you verify that the content was correctly indexed.
-
For plain text content
When your content is plain text, you can convert it to the xml data type and add language tags that indicate the language corresponding to each specific document or document section. For this to work, however, you need to know the language before full-text indexing.