Thesaurus Configuration

In SQL Server, full-text queries can search for synonyms of user-specified terms through the use of a thesaurus. A SQL Server thesaurus defines a set of synonyms for a specific language. System administrators can define two forms of synonyms: expansion sets and replacement sets. By developing a thesaurus tailored to your full-text data, you can effectively broaden the scope of full-text queries on that data. Thesaurus matching occurs only for CONTAINS and CONTAINSTABLE queries that specify the FORMSOF THESAURUS clause and for FREETEXT and FREETEXTABLE queries.

Before full-text search queries on your server instance can look for synonyms in a given language, you must define thesaurus mappings (synonyms) for that language. Each thesaurus must be manually configured to define the following:

  • Diacritics setting

    For a given thesaurus, all search patterns are either sensitive or insensitive to diacritical marks such as a tilde (~), acute accent mark (´), or umlaut (¨) (that is, accent sensitive or accent insensitive). For example, suppose you specify the pattern "café" to be replaced by other patterns in a full-text query. If the thesaurus is accent-insensitive, full-text search replaces the patterns "café" and "cafe". If the thesaurus is accent-sensitive, full-text search replaces only the pattern "café". By default, a thesaurus is accent-insensitive.

    Note

    For information about diacritical marks, see Diacritical Mark in the MSN Encarta Encyclopedia.

  • Expansion set

    An expansion set contains a group of synonyms such as "writer", "author", and "journalist" that are substituted for one another by a full-text query. Queries that contain a match for any synonym in an expansion set are expanded to include every other synonym in the expansion set.

    For more information, see "XML Structure of an Expansion Set," later in this topic.

  • Replacement set

    A replacement set contains a text pattern to be replaced by a substitution set. For an example, see the section "XML Structure of a Replacement Set" later in this topic.

Note

For restrictions on and recommendations for a thesaurus file, see How to: Edit a Thesaurus File (Full-Text Search).

SQL Server provides a set of XML thesaurus files, one for each supported language. These files are essentially empty. They contain only the top-level XML structure that is common to all SQL Server thesauruses and a commented-out sample thesaurus.

This topic contains information to help achieve this task, as follows:

  • Initial Content of the Thesaurus Files

  • Location of the Thesaurus Files

  • How Queries Use Thesaurus Files

  • Understanding the Structure of a Thesaurus File

  • Working With Thesaurus Files

Initial Content of the Thesaurus Files

The thesaurus files that are released with SQL Server 2008 all contain the following XML code:

<XML ID="Microsoft Search Thesaurus">

<!--  Commented out

    <thesaurus xmlns="x-schema:tsSchema.xml">
<diacritics_sensitive>0</diacritics_sensitive>
        <expansion>
            <sub>Internet Explorer</sub>
            <sub>IE</sub>
            <sub>IE5</sub>
        </expansion>
        <replacement>
            <pat>NT5</pat>
            <pat>W2K</pat>
            <sub>Windows 2000</sub>
        </replacement>
        <expansion>
            <sub>run</sub>
            <sub>jog</sub>
        </expansion>
    </thesaurus>
-->
</XML>

[Top]

Location of the Thesaurus Files

The default location of the thesaurus files is:

<SQL_Server_data_files_path>\MSSQL10_50.MSSQLSERVER\MSSQL\FTDATA\

This default location contains the following files:

  • Language-specific thesaurus files

    During setup, empty thesaurus files are installed in the above location. A separate file is provided for each supported language. A system administrator can customize these files.

    The default file names of the thesaurus files use following format:

    ‘ts’ + <three-letter language-abbreviation> + '.xml'

    The name of the thesaurus file for a given language is specified in the registry in the following value HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\<instance-name>\MSSearch\<language-abbrev>.

  • The global thesaurus file

    An empty global thesaurus file, tsGlobal.xml.

You can change the location and names of a thesaurus file by changing its registry key. For each language, the location of the thesaurus file is specified in the following value in the registry:

HKLM/SOFTWARE/Microsoft/Microsoft SQL Server/<instance name>/MSSearch/Language/<language-abbreviation>/TsaurusFile

The global thesaurus file corresponds to the Neutral language with LCID 0. This value can be changed by administrators only.

[Top]

How Queries Use Thesaurus Files

A thesaurus query uses both a language-specific thesaurus and the global thesaurus. First, the query looks up the language-specific file and loads it for processing (unless it is already loaded). The query is expanded to include the language-specific synonyms specified by the expansion set and replacement set rules in the thesaurus file. These steps are then repeated for the global thesaurus. However, if a term is already part of a match in the language specific thesaurus file, the term is ineligible for matching in the global thesaurus.

[Top]

Understanding the Structure of a Thesaurus File

Each thesaurus file defines an XML container whose ID is Microsoft Search Thesaurus, and a comment, <!-- … -->, that contains a sample thesaurus. The thesaurus is defined in a <thesaurus> element that contains samples of the child elements that define the diacritics setting, expansion sets, and replacement sets, as follows:

  • XML Structure of the Diacritical Setting

    The diacritics setting of a thesaurus is specified in a single <diacritics_sensitive> element. This element contains an integer value that controls accent sensitivity, as follows:

    Diacritics Setting

    Value

    XML

    Accent insensitive

    0

    <diacritics_sensitive>0</diacritics_sensitive>

    Accent sensitive

    1

    <diacritics_sensitive>1</diacritics_sensitive>

    Note

    This setting can only be applied one time in the file, and it applies to all search patterns in the file. This setting cannot be specified for individual patterns.

  • XML Structure of an Expansion Set

    Each expansion set is enclosed within an <expansion> element. Within this element, you specify one or more substitutions in a <sub> element. In the expansion set, you can specify a group of substitutions that are synonyms of each other.

    For example, you can edit the expansion section to treat the substitutions "writer", "author", and "journalist" as synonyms. full-text search queries that contain matches in one substitution are expanded to include all other substitutions specified in the expansion set. Therefore, in the preceding example, when you issue a FORMS OF THESAURUS or a FREETEXT query for the word "author", full-text search also returns search results containing the words "writer" and "journalist".

    This is what the expansion set section would look like for the above example:

     <expansion>
             <sub>writer</sub>
             <sub>author</sub>
             <sub>journalist</sub>
     </expansion>
    
  • XML Structure of a Replacement Set

    Each replacement set is enclosed within a <replacement> element. Within this element you can specify one or more patterns in a <pat> element and zero or more substitutions in <sub> elements, one per synonym. You can specify a pattern to be replaced by a substitution set. Patterns and substitutions can contain a word, or a sequence of words. If there is no substitution specified for a pattern, it has the effect of removing the pattern from the user query.

    For example, suppose you want queries for "W2K", the pattern, to be replaced by "Windows 2000" or "XP", the substitutions. If you run a full-text query for "W2K", full-text search only returns search results containing "Windows 2000" or "XP". It does not return results containing "W2K". This is because the pattern "W2K" has been "replaced" by the patterns "Windows 2000" and "XP".

    This is what the replacement set section would look like for the above example:

     <replacement>
             <pat>W2K</pat>
             <sub>Windows 2000</sub>
             <sub>XP</sub>
     </replacement>
    

    If you have two replacement sets with similar patterns being matched, the longer of the two takes precedence. For example, if you run a FORMS OF THESAURUS query for "Internet Explorer online community" and you have the following replacement sets, the "Internet Explorer" replacement set takes precedence over the "Internet" replacement set. The query will therefore be processed as "IE online community" or "IE 5 online community".

    <replacement>
             <pat>Internet</pat>
             <sub>intranet</sub>
    </replacement>
    

    and

    <replacement>
             <pat>Internet Explorer</pat>
             <sub>IE</sub>
             <sub>IE 5</sub>
    </replacement>
    

[Top]

Working with Thesaurus Files

To edit a thesaurus file

To load an updated thesaurus file

To view the tokenization result of a word breaker, thesaurus, and stoplist combination