Edit a thesaurus file (Office SharePoint Server)
Updated: December 11, 2008
Applies To: Office SharePoint Server 2007
A thesaurus file is a query-expansion search feature in Microsoft Office SharePoint Server 2007 that enables users to type a phrase in a search box and receive results for words that are related to the phrase that was entered. For example, a search for the word "run" might return results that contain either the words "run" or "jog" if the two terms are related in the thesaurus file. Within a thesaurus file, you use replacements sets to specify patterns that are replaced with alternate values, and you use expansion sets to return additional values that are synonymous with the specified pattern.
In this article:
Understanding thesaurus files
When Microsoft Office SharePoint Server 2007 is installed, a thesaurus file is automatically included for each of the languages supported by Office SharePoint Server 2007, along with a neutral thesaurus file, tsneu.xml. The neutral tsneu.xml thesaurus file is applied to queries that do not have a thesaurus file associated with the query language. The neutral thesaurus file is always applied to queries, even when there is a specific thesaurus file associated with the query language. For more information, see the "List of thesaurus files by language" section.
By default, thesaurus files are created and stored in the following location on the query server: Drive:\Program Files\Microsoft Office Servers\12.0\Data\Config. The thesaurus files from that default location are copied to the following folder location for each instance of the Microsoft Search service that exists on the query server: Drive:\Program Files\Microsoft Office Servers\12.0\Data\Office Server\Applications\<Application UID>\Config, where <Application UID> is the GUID associated with a particular shared services provider.
If you modify the thesaurus files in the default location, the modified version of the files will automatically be copied every time a new Shared Service Provider (SSP) is created. If you modify the thesaurus files in the default location after a SSP has been created, you will need to copy the files from the default location to the specified directory for each SSP that already exists.
A file named tsschema.xml is installed in the same directory with the thesaurus files. Do not modify the tsschema.xml file. This file is referenced by all other thesaurus files, and changing this file could cause search to not work properly.
By default, each thesaurus file contains inactive sample content. You must edit a thesaurus file before it can be used by search. Thesaurus files contain two primary types of entries: replacement sets and expansion sets. These entries will be described in more detail in later sections of this topic. A third type of entry, diacritics_sensitive, is used to specify whether or not such diacritic marks as accents are ignored or respected by search. By default, diacritics are ignored, so the value is set to 0. To have diacritic marks respected by search, change the value to 1.
The following is an example of the default XML in a thesaurus file:
<XML ID="Microsoft Search Thesaurus"> <!-- Commented out <thesaurus xmlns="x-schema:tsSchema.xml"> <diacritics_sensitive>0</diacritics_sensitive> <expansion> <sub>Internet Explorer</sub> <sub>IE</sub> <sub>IE5</sub> </expansion> <replacement> <pat>NT5</pat> <pat>W2K</pat> <sub>Windows 2000</sub> </replacement> <expansion> <sub>run</sub> <sub>jog</sub> </expansion> </thesaurus> --> </XML>
From a performance perspective, it is important to be aware of how many items are defined in the thesaurus file, and not exceed 1,000/10,000 (typical/maximum) items. Be aware that each
The entries that you add to the thesaurus file cannot contain only special characters. However, you can have blank entries. For example, if you want to make sure that queries for a specific term return no results, change the entry. In the following example, queries for the term “windows” will not return any results:
<replacement> <pat>windows</pat> <sub></sub> </replacement>
Noise words can be included in a thesaurus file; however, they get filtered out at a later stage if you are also using a noise word file. For more information, see Edit a noise word file (Office SharePoint Server).
Using replacement sets
A replacement set specifies a pattern that is replaced by one or more substitutions in a search query. For example, you can add a replacement set where “W2K” is the pattern and where “Windows 2000” is the substitution. A query for the term “W2K”, Office SharePoint Server 2007 will return only search results that contain the term “Windows 2000”. The search results will not return items that contain the term “W2K”.
Each replacement set is enclosed in a
<replacement> tag. Within the replacement tag, specify one or more patterns by enclosing the patterns in a
<pat> tag, and specify one or more substitutions by enclosing the substitutions in a
<sub> tag. Patterns and substitutions can contain a word or a sequence of words. For example, to add a replacement set where “W2K” is the pattern and “Windows 2000” is the substitution, use the following:
<replacement> <pat>W2K</pat> <sub>Windows 2000</sub> </replacement>
You can have more than one substitution for each pattern that you specify.
Ideally, replacement sets should be used with terms that everybody understands are the same. For example, consider the scenario where a deprecated term such as an internal product name should be replaced by another term in a query, such as the released product name.
Using expansion sets
An expansion set is a group of substitutions that are synonyms of each other. Queries that contain matches in one substitution are expanded to include all other substitutions in the expansion set. For example, you can add an expansion set where the following substitutions are synonyms:
If you query the term “author”, Office SharePoint Server 2007 also returns search results that contain the term “writer” and the term “journalist”.
Each expansion set is enclosed in an
<expansion> tag. Within the
<expansion> tag, specify one or more substitutions by enclosing them in a
<sub> tag. For example, in the previous example, add the following lines:
<expansion> <sub>writer</sub> <sub>author</sub> <sub>journalist</sub> </expansion>
You can include individual words or phrases in a thesaurus file. The word breaker for a given language identifies individual words by determining where word boundaries exist based on the lexical rules of the language. If you include a word in a thesaurus file that the word breaker does not recognize as a single word, you should also include the word in a custom dictionary so that the word breaker does not break up the word into smaller tokens. For example, if you use the word “IT&T” in an expansion set, but do not include it in a custom dictionary, the word breaker might break the word down into two separate words, “IT” and “T”. This can cause the expansion set to not work as expected when a search query is performed. For information about creating and using custom dictionaries, see Create a custom dictionary (Office SharePoint Server 2007).
Editing a thesaurus file
Use the following procedure to edit a thesaurus file.
When editing a file, you must use matching pairs of opening and closing tags around each entry in the file. If the XML tags in the thesaurus file are not matched, an error will be logged in the application event log.
Edit a thesaurus file
Start Notepad, and then open a thesaurus file. For information on locating and identifying the appropriate thesaurus file, see the "Understanding thesaurus files" section.
If you are changing the thesaurus file for the first time, remove the
<!-- Commented outcomment line that appears at the beginning of the file, and the
-->comment line that appears at the end of the file.
Make any changes to the thesaurus file. Add, modify, or delete a replacement set or an expansion set.
Save the thesaurus file, and then close Notepad.
List of thesaurus files by language
English (United Kingdom)
English (United States)