Click to Rate and Give Feedback
TechNet
TechNet Library
SQL Server
SQL Server 2005
Full-Text Search
Thesaurus
 Configuring Thesaurus Files
Community Content
In this section
Statistics Annotations (3)
Collapse All/Expand All Collapse All
SQL Server 2005 Books Online (November 2008)
Configuring Thesaurus Files

Updated: 12 December 2006

All thesaurus files that are included with Microsoft SQL Server 2005 are formatted as follows.

<XML ID="Microsoft Search Thesaurus">

<!--  Commented out
    <thesaurus xmlns="x-schema:tsSchema.xml">
      <diacritics = false/>
        <expansion>
            <sub>Internet Explorer</sub>
            <sub>IE</sub>
            <sub>IE5</sub>
        </expansion>
        <replacement>
            <pat>NT5</pat>
            <pat>W2K</pat>
            <sub>Windows 2000</sub>
        </replacement>
        <expansion>
            <sub>run</sub>
            <sub>jog</sub>
        </expansion>
    </thesaurus>
-->
</XML>

Each thesaurus file has one or more of the following sections:

  • Expansion set
    An expansion set contains a group of synonyms. These synonyms are identified in code by "substitution" tags (<sub> and </sub>). Queries that contain matches in one substitution are expanded to include all other substitutions in the expansion set.
  • Replacement set
    A replacement set contains a text pattern to be replaced by a substitution set. For an example, see the section "Replacement Set" later in this topic.

Additionally, the thesaurus file includes a <diacritics = false/> tag. false indicates that the terms specified in the expansion and replacement sets are accent-insensitive. To make searches using the thesaurus accent-sensitive, change this tag to <diacritics = true/>. For example, suppose you specify the pattern "café" to be replaced by other patterns in a Full-Text Search query. If the thesaurus file is accent-insensitive, Full-Text Search replaces the patterns "café" and "cafe". If the thesaurus file is accent-sensitive, Full-Text Search replaces only the pattern "café". Note that this setting can only be applied one time in the file, and applies to all the search patterns in the file. This setting cannot be specified for individual patterns.

ms345186.note(en-US,SQL.90).gifImportant:
When you are editing thesaurus files by using text editor tools, the files must be saved in Unicode format and Byte Order Marks must be specified.

Each expansion set is enclosed within an <expansion> tag. Within the expansion tag, you specify one or more substitutions enclosed by a <sub> tag. In the expansion set, you can specify a group of substitutions that are synonyms of each other.

For example, you can edit the expansion section to treat the substitutions "writer", "author", and "journalist" as synonyms. Full-Text Search queries that contain matches in one substitution are expanded to include all other substitutions specified in the expansion set. Therefore, in the preceding example, when you issue a FORMS OF THESAURUS or a FREETEXT query for the word "author", Full-Text Search also returns search results containing the words "writer" and "journalist".

This is what the expansion set section would look like for the above example:

 <expansion>
         <sub>writer</sub>
         <sub>author</sub>
         <sub>journalist</sub>
 </expansion>

Each replacement set is enclosed within a <replacement> tag. Within each replacement tag you can specify one or more patterns enclosed by a <pat> tag. You can specify one or more substitutions enclosed by <sub> tags. You can specify a pattern to be replaced by a substitution set. Patterns and substitutions can contain a word, or a sequence of words.

For example, suppose you want queries for "W2K", the pattern, to be replaced by "Windows 2000" or "XP", the substitutions. If you run a full-text query for "W2K", Full-Text Search only returns search results containing "Windows 2000" or "XP". It does not return results containing "W2K". This is because the pattern "W2K" has been "replaced" by the patterns "Windows 2000" and "XP".

This is what the replacement set section would look like for the above example:

 <replacement>
         <pat>W2K</pat>
         <sub>Windows 2000</sub>
         <sub>XP</sub>
 </replacement>

If you have two replacement sets with similar patterns being matched, the longer of the two takes precedence. For example, if you run a FORMS OF THESAURUS query for "Internet Explorer online community" and you have the following replacement sets, the "Internet Explorer" replacement set takes precedence over the "Internet" replacement set. The query will therefore be processed as "IE online community" or "IE 5 online community".

<replacement>
         <pat>Internet</pat>
         <sub>intranet</sub>
</replacement>

and

<replacement>
         <pat>Internet Explorer</pat>
         <sub>IE</sub>
         <sub>IE 5</sub>
</replacement>

Release History

12 December 2006

Changed content:
  • Corrected the syntax of the <diacritics_sensitive> tag to <diacritics = false/> and updated the explanation of this tag.
New content:
  • Added the Important not that states thesaurus files must be saved in Unicode format and Byte Order Marks must be specified.

17 July 2006

New content:
  • Clarified the meaning of the <diacritics_sensitive> tag.
Tags What's this?: Add a tag
Community Content   What is Community Content?
Add new content RSS  Annotations
Does not work for hyphenated terms      Ed Graham   |   Edit   |   Show History
E.g., the following expansion will cause the entire thesaurus file to fail (writing an error to the Application log):

<expansion>
<sub>AERIAL SPORTS</sub>
<sub>AIR SPORTS</sub>
<sub>FLYING (SPORT)</sub>
<sub>GLIDING</sub>
<sub>HANG-GLIDING</sub>
<sub>PARACHUTING</sub>
</expansion>

Can this be made explicit in the documentation? Also, the parenthesis term FLYING (SPORT) seems to work, so can we clarify exactly what forms are permissible?

Ed Graham
Tags What's this?: Add a tag
Flag as ContentBug
Diacritics XML tag is neither valid nor well-formed.      bcraun ... Thomas Lee   |   Edit   |   Show History
  <diacritics = false/>
  

Is this a benign and something SQL Server Full-Text Indexing engine is expecting?

Tags What's this?: Add a tag
Flag as ContentBug
Diacritcs tag makes no sense.      poindexter   |   Edit   |   Show History
The <diacritics = false/> tag makes no sense at all. This is not valid XML.
Processing
© 2012 Microsoft. All rights reserved. Terms of Use | Trademarks | Privacy Statement
Page view tracker