Export (0) Print
Expand All

Create a custom dictionary (Search Server 2008)

Updated: October 9, 2008

Applies To: Microsoft Search Server 2008

 

Topic Last Modified: 2008-10-02

A custom dictionary is a Unicode-encoded file that can be used to specify words that you want the word breaker of the same language to consider as complete words. Custom dictionaries are not provided, by default. To modify the word breaker behavior for more than one language you must create a separate custom dictionary for each language for which you want to modify the word breaker’s behavior. You cannot create a custom dictionary for the language-neutral word breaker.

The following table lists the languages and dialects for which Microsoft Search Server 2008 supports custom dictionaries. This table also includes the language code identifier (LCID) and language hex code for each language and dialect.

Note that the first two numbers in the hex code of each language represent the dialect and the last two numbers represent the language. For languages that do not have separate word breakers for separate dialects, the first two numbers in the language hex code are always zeros.

Table 1 - Supported languages

 

Language / dialect LCID Language hex code

Arabic

1025

0001

Bengali

1093

0045

Bulgarian

1026

0002

Catalan

1027

0003

Croatian

1050

001a

Danish

1030

0006

Dutch

1043

0013

English

1033

0009

French

1036

000c

German

1031

0007

Gujarati

1095

0047

Hebrew

1037

000d

Hindi

1081

0039

Icelandic

1039

000f

Indonesian

1057

0021

Italian

1040

0010

Japanese

1041

0011

Kannada

1099

004b

Latvian

1062

0026

Lithuanian

1063

0027

Malay

1086

003e

Malayalam

1100

004c

Marathi

1102

004e

Norwegian_Bokmaal

1044

0414

Portuguese

2070

0816

Portuguese_Braz

1046

0416

Punjabi

1094

0046

Romanian

1048

0018

Russian

1049

0019

Serbian_Cyrillic

3098

0c1a

Serbian_Latin

2074

081a

Slovak

1051

001b

Slovenian

1060

0024

Spanish

3082

000a

Swedish

1053

001d

Tamil

1097

0049

Telugu

1098

004a

Ukrainian

1058

0022

Urdu

1056

0020

Vietnamese

1066

002a

Custom dictionaries are used to make the word breaker of a particular language ignore (or not word break) a particular word. To understand whether you need a custom dictionary and what words or entries it should contain, it is helpful to understand the behavior of word breakers.

Word breakers are used by the indexing system to break words into tokens when content is indexed. Word breakers are also used by the query system to break words in a query into tokens. In both cases, if an existing custom dictionary has been created that supports the language and dialect of the word breaker that is being used, the Office Server Search Service determines if the word exists in the custom dictionary before determining whether to use the word breaker for that word. If the word does not exist in the custom dictionary, the word breaker performs its usual actions, which might result in breaking the word into multiple words or tokens. If the word does exist in the custom dictionary, the word breaker does not perform any actions on that word.

The following examples describe typical word breaker behavior and how an entry in the custom dictionary can affect that behavior.

A particular word breaker that encounters a word such as IT&T might break the word at the ampersand symbol (&). The result is the words IT and the letter T which the word breaker for most languages would discard as noise words. However, if the word, IT&T, exists in the custom dictionary of the same language as the word breaker that is being used, the word breaker would ignore the word, IT&T. This means that, if a full crawl were being performed, the word would be indexed as IT&T. When a user types a query for the word, IT&T, the word breaker would not break the word. Specifically, queries containing “IT” or “T” would not return search results for documents that do not contain those words but do contain the word “IT&T”.

Terms like SystemicChemicalNames (SCN) or CAS Numbers can be affected by word-breakers. For example, word breakers typically split single numbers that appear before or after a hyphen or other special character from the rest of the number. An example of a CAS number is 7782-44-7, which is the CAS registry number for oxygen. After word-breaker processing, this word is broken into three separate parts: the numbers 7782, 44, and 7. Adding the SCN and CAS numbers that appear in a corpus to the custom dictionary, for each language in which they apply, enables the system to index the SCN and CAS numbers without breaking them into separate numbers. Because the appropriate word breaker and custom dictionary for the language of the content is used at query time, a user can also include an SCN or CAS number in her query without it being broken into separate parts.

Named entity normalizations, such as date normalization, that are normally applied by word breakers are not applied to query terms that appear in custom dictionaries. Instead all query terms that appear in custom dictionaries are treated as an exact match. This is especially important if you have words or numbers (such as the ones mentioned earlier) in a thesaurus file. For example, if the CAS number, 7782-44-7, is part of an expansion set in the thesaurus and the word breaker splits that number at the hyphens into three separate numbers, then the expansion set that the number is a part of might not work as expected. In this case, adding the CAS number, 7782-44-7, to the custom dictionary of the appropriate language resolves the problem.

Creating or modifying a custom dictionary is simple. A custom dictionary is simply a Unicode-formatted file with entries (the words you specify) on separate lines separated by a carriage return (CR) and line feed (LF). When adding entries to a custom dictionary, keep the following rules in mind to avoid experiencing unexpected results:

  • Entries are not case sensitive.

  • The pipe (|) character cannot be used anywhere in a custom dictionary.

  • White space cannot be used anywhere in a custom dictionary.

  • The pound sign (#) character cannot be used at the beginning of an entry but it can be used within or at the end of an entry.

  • Except for the pipe, pound sign, and white space characters mentioned earlier, any alphanumeric characters, punctuation, symbols, and breaking characters are valid.

  • The maximum length of an entry is 128 (Unicode) characters.

The following table shows examples of supported and unsupported entries.

Table 2 – Examples of supported and unsupported entries

 

Supported Not supported

dogfood

dog food

3#

#3

Four#sale

dog|food

ASP.NET

IT&T

(2-Methoxymethylethoxy)propanol

34590-97-8

C7H1603

There is no fixed limit to the number of entries in a custom dictionary but we recommend that the total file size of a custom dictionary does not exceed 2 GB. In practice, we suggest that you limit the number of entries to a few thousand.

Before you create a custom dictionary, make sure you have read the Before you begin section earlier in this article because it is important to understand the difference between supported and unsupported entries in a custom dictionary.

NoteNote:
To perform this procedure, you must be a member of the Administrators group on the index server and each query server in the server farm.
To create a custom dictionary
  1. Log on to the index server as a member of the Administrators group.

  2. Start Notepad and type the words you want in your custom dictionary. Be sure to avoid invalid entries as described in the Before you begin section.

    TipTip:
    Remember that each word must be on a separate line and separated by a carriage return (CR) and line feed (LF).
  3. On the File menu, click Save As.

  4. In the Save as type list, select All Files.

  5. In the Encoding list, select Unicode.

  6. In the File name box, type the file name in the following format: CustomNNNN.lex, where NNNN is the language hex code of the language for which you are creating the custom dictionary. See Table 1, earlier in this article for a list of valid file names for supported languages and dialects.

  7. In the Save in list, navigate to the folder that contains the word breakers. By default, this is drive:\program files\Microsoft Office Servers\12\bin, where drive is the drive letter on which Search Server 2008 is installed.

  8. Click Save.

    Perform the following procedure only if you have query servers that are separate from the index server. Otherwise, go to Stop and restart the Office SharePoint Server Search Service.

Copy the custom dictionary to other servers
  1. Log on to the index server as a member of the Administrators group.

  2. Navigate to the folder in which you saved the custom dictionary file.

  3. Copy the custom dictionary file to the folder that contains the word breakers on your first query server. By default, this is drive:\program files\Microsoft Office Servers\12\bin, where drive is the drive letter on which Search Server 2008 is installed.

  4. Perform a full crawl of the affected content. For information about performing a full crawl, see Crawl content (Search Server 2008).

  5. Repeat steps 1 through 3 on each query server in the server farm.

You must restart the OSearch Service on the index server and all query servers.

ImportantImportant:
Do not use the Services on Server page in Central Administration to stop and start these services because doing so removes the service and deletes the index and associated configuration. Instead, use the following steps.
To stop and restart the Office SharePoint Server Search service
  1. Log onto the index server as a member of the Administrators group.

  2. On the Start menu, point to All Programs, point to Administrative Tools, and then click Services.

  3. Scroll down the list, right-click the Office SharePoint Server Search service, and then click Properties. The properties page appears.

  4. Click Stop. After the service is stopped, click Start.

  5. Ensure that the Startup Type is not set to Disabled.

  6. If your server farm has query servers that are separate from the index server, repeat steps 1 through 5 on each query server.

To apply the custom dictionary to the content index, you must perform a full crawl of all content sources that contain the words that you added to the custom dictionary. For information about performing a full crawl, see Crawl content (Search Server 2008).

Was this page helpful?
(1500 characters remaining)
Thank you for your feedback
Show:
© 2014 Microsoft