Textual Searches on Database Data Using Microsoft SQL Server 7.0 

 

Microsoft Corporation

January 1999

Summary: This article describes the concept of full-text search, as well as the design and architecture of the SQL Server 7.0 full-text search system. This article also illustrated how the SQL Enterprise Manager facilitated administration with stored procedures and graphical user interfaces. (19 printed pages)

Contents

Introduction
Full-Text Search Concepts
Transact-SQL Extensions for Full-Text Searches
Textual Searches against File-System Data
Component Architecture
Administration
Conclusion

Introduction

Microsoft® SQL Server™ version 7.0 introduces facilities that support textual queries on data residing in SQL Server as well as textual queries on data in the file system.

This document describes support for textual queries against data within SQL Server tables. The full-text search concepts will be introduced, followed by the form that a full-text search would take and the type of information that can be retrieved by means of such queries. The internal design and architecture of the full-text search system will be presented, and then how this system can be administered by means of stored procedures or graphical user interfaces (GUIs) through SQL Server Enterprise Manager will be described.

This document should provide a good overview of the capability of SQL Server 7.0 full-text support and a good grasp of how the various subcomponents interact to provide this support.

A large portion of digitally stored information is in the form of unstructured data, primarily text. While the bulk of this data is stored in the file system, some organizations have begun to manage it by storing it in relational databases in character-based columns such as varchar and text. This means that relational database users need a mechanism to effectively retrieve textual data from the database. Traditional relational database management systems (RDBMSs), such as Microsoft SQL Server version 6.5, were not designed for efficient full-text retrieval. For example, while SQL Server 6.5 has some capabilities for retrieving text based on pattern matching, it could not search for words and phrases in close proximity to one another.

There are two major types of textual searches:

  • Property

    This search technology first applies filters to documents in order to extract properties such as author, subject, type, word count, printed page count, and time last written, and then issues searches against those properties.

  • Full-text

    This search technology first creates indexes of all non noise words in the documents, and then uses these indexes to support linguistic searches and proximity searches.

The lack of integrated textual-search facilities in relational databases has forced customers to use third-party products. These solutions usually involve pulling data out of the database through bridges or gateways and storing the data as files in the file system so that full-text indexing can be applied. This is not a seamless way for a user to combine a full-text search with a regular structured relational query.

Some relational database products offer customers relational and full-text search conditions seamlessly integrated into the same query. A demonstration of how such a query might be specified follows:

Suppose that the contents of a set of plain-text documents reside in the DocText column of the doc_collection table, and that the table also contains the StorName, Size, and DocAuthor columns. It should be possible to issue the following query:

SELECT Q.StorName, Q.Size, Q.DocAuthor, W.Citizenship 
FROM doc_collection as Q,
     writers as W
WHERE CONTAINS(DocText, ' "SQL Server" NEAR text') 
  AND Q.DocAuthor = W.writer_name

This query is issued to obtain the names, authors, and sizes of all documents, where the document contains the phrase "SQL Server" in close proximity to the word "text". This information is joined with the writers table to obtain the author's citizenship.

Over the past year, Microsoft has been engaged in a project that will make it possible to issue such queries. The objective was to introduce the Microsoft text search technology as part of SQL Server 7.0 so that users could issue entry-level full-text searches against plain-text data in relational database tables. The syntax is a natural extension to the SQL language.

Existing technologies have been leveraged to provide full-text searches against SQL Server data. The Microsoft information retrieval technologies have been around for some time and have been shipped with Microsoft Index Server version 2.0 and Microsoft Site Server version 3.0. Various components have been combined within these technologies, and these have been integrated into SQL Server 7.0 to provide relational database customers with full-text retrieval support.

Two additional technologies outside of SQL Server also have been integrated:

  • The Microsoft Search service, a full-text indexing and search service that is called both index engine and search engine in the context of this document; within the context of SQL Server, this is called full-text search
  • The parser component of the OLE DB Provider for Index Server 2.0 that accepts full-text SQL extensions and maps them into a form that can be processed by the search engine

This document has three purposes:

  • To introduce the SQL extensions in support of full-text searches on database data
  • To show how the various technologies have been brought together to allow full-text searches on database data
  • To introduce the new SQL Server administrative facilities in support of full-text searches on database data

Full-Text Search Concepts

There are four major aspects of supporting full-text retrieval of plain-text data on a database:

  • Managing the definition of the tables and columns that are registered for full-text searches

  • Indexing the data in registered columns

    The indexing process scans the character streams, determines the word boundaries (this is called word breaking), removes all noise words (this also is called stop words), and then populates a full-text index with the remaining words.

  • Issuing queries against registered columns for populated full-text indexes

  • Ensuring that subsequent changes to the data in registered columns gets propagated to the index engine to keep the full-text indexes synchronized

The underlying design principle for the indexing, querying, and synchronizing processes is the presence of a full-text unique key column (or single-column primary key) on all tables registered for full-text searches. The full-text index contains an entry for the non noise words in each row together with the value of the key column for each row.

When processing a full-text search, the search engine returns to SQL Server the key values of the rows that match the search criteria. The following example illustrates what is happening:

Suppose you have a SciFi table with the following columns, where the Book_No column is the primary key column, as shown in Table 1 below:

Table 1. SciFi table

Book_No Writer Title
A025 Asimov Foundation's Edge
A027 Asimov Foundation and Empire
C011 Clarke Childhood's End
V109 Verne Mysterious Island

Next, suppose you have a full-text retrieval query where you want to find the book titles having the word "Foundation" in them. When the SQL Server relational engine encounters a full-text retrieval predicate, it calls the full-text search component to retrieve the values of Book_No that satisfy the text retrieval filter condition. In this case, the values A025 and A027 are returned. The relational engine then uses this information, together with other information under its direct control, to respond to the query.

Unlike classic relational database indexes, traditional full-text indexes are not modified instantly when values in full-text registered columns are updated, when rows are added to full-text registered tables, or when rows are deleted from full-text registered tables. Rather, full-text indexes usually are repopulated asynchronously. There are two reasons for this:

  • It typically takes significantly more time to update a full-text index than a classic index
  • Full-text searches usually are fuzzy by nature and so do not need to be as precise as classic searches

During repopulation, the unique key column values are passed to the index engine to identify those items that need to be reindexed. For example, if the title associated with V109 gets changed to "Mystery Island," then the index should be modified to reflect this new value.

The full-text administration process starts by designating a table and its columns of interest for full-text search. Either GUIs and wizards or built-in stored procedures are used first to register tables and columns as eligible for full-text search. Then, a separate request (again by means of a GUI and wizards or stored procedures) is issued to populate the full-text indexes.

The result is that the underlying index engine gets invoked and asynchronous index population begins. Full-text indexing tracks which significant words are used and where they are located. For example, a full-text index might indicate that the word "Microsoft" is found at word number 423 and word number 982 in the Abstract column of the DevTools table for the row associated with a ProductID of 6. This index structure supports an efficient search for all items containing indexed words as well as advanced search operations, such as phrase searches and proximity searches. (An example of a phrase search is looking for "white elephant," where "white" is followed by "elephant." An example of a proximity search is looking for "big" and "house" where "big" occurs near "house.")

To prevent the full-text index from becoming bloated, noise words such as "a," "and," and "the" are ignored. Noise word lists for many languages are available in the directory \Mssql\Ftdata\Sqlserver\Config. And the set of supported languages is growing. The choice of a particular noise word list is based on the language setting of the database server. These noise word lists should be sufficient for most operations, but they can be modified by using a regular text editor. For example, a computer company can add the word "computer" to its noise word list.

Extensions to Microsoft Transact-SQL were made to allow users to pose full-text queries. The syntax of the CONTAINS and FREETEXT predicates, already supported for full-text search in both the Microsoft OLE DB Provider for Index Server 2.0 and the Microsoft OLE DB Provider for Site Server 3.0 Search, were used.

The CONTAINS predicate is used to search for:

  • A word or phrase
  • The prefix of a word or phrase
  • A word or phrase that is near another
  • A word that is an inflectional form of another (for example, "drive" is the inflectional stem of "drives," "drove," "driving," and "driven")
  • A set of words or phrases, each of which is assigned a different weighting

The FREETEXT predicate is a basic form of natural language query. Any text, including words, phrases, or sentences, can be specified in the query. The search engine matches values that reflect the meaning, rather than the exact wording, of the query.

The SQL Server relational engine recognizes the CONTAINS and FREETEXT predicates and performs some minimal syntax and semantic checking, such as ensuring that the column referenced in the predicate has been registered for full-text searches. During query execution, a full-text predicate and other relevant information are passed to the full-text search component. After further syntax and semantic validation, the search engine is invoked and returns the set of unique key values identifying those rows in the table that satisfy the full-text search condition.

There currently are two mechanisms available to an administrator for keeping a full-text index synchronized with changes to the data in its table:

  • Scheduled repopulation
  • On-demand repopulation

Both GUIs and stored procedures are provided for repopulation tasks. If a table has a column with a timestamp data type, it is possible to request an incremental repopulation where only changes to the table since the last population participate in the repopulation. If a table does not have a timestamp column, then only a complete repopulation is possible.

An alternative to automatic update of full-text indexes based on log activity is a candidate for a future release of SQL Server.

Transact-SQL Extensions for Full-Text Searches

A point worth repeating is that these extensions are consistent with the SQL language supported for full-text searches in Index Server 2.0. Furthermore, the SQL support for full-text searches follows the SQL-3 functional methodology for full-text syntax extensions.

The primary extension to Transact-SQL consists of the CONTAINS and FREETEXT predicates. These predicates are used to find column values that match special full-text search criteria.

In addition, the new CONTAINSTABLE() and FREETEXTTABLE() rowset-valued functions are supported. These functions can be specified (in the FROM clause) to return a two-column table, where one column uniquely identifies the rows that match the specified full-text search criteria and the other contains rank values showing the degree to which the rows match the criteria.

To be consistent with similar features in other products and to make the new predicates more extensible, functional notation has been chosen. The high-level syntax is:

 ?--CONTAINS--(column_ref[s]--,ft_search_condition)--?
   +FREETEXT+ 

In Microsoft SQL Server 7.0, the language always is implied by the locale of the database holding the data. The flexibility of the functional style of these two predicates allows easy, upward-compatible future extension for a third parameter to designate the language to be used for the query.

These predicates can be used with the character family of data types: char, varchar, text, nchar, nvarchar, and ntext. Often, document formats stored in columns defined with the binary family of data types cannot be indexed or searched. Such support will be considered in future releases of SQL Server.

The CONTAINS Predicate

The CONTAINS predicate is used to determine whether or not values in full-text registered columns contain certain words and phrases. Currently, these predicates can reference base tables only.

The CONTAINS predicate syntax is as follows in figure 1:

Figure 1. CONTAINS predicate syntax

Here are the definitions of the syntax terminology.

  • column_ref or *

    The column or columns to be searched.

  • column_ref

    A specific column that is full-text registered.

  • *

    All columns in the table that are full-text registered.

  • AND, OR, and AND NOT

    The Boolean operators used to join, or combine, terms.

simple_term

The simple_term is used to match the exact word or phrase being searched for.

The simple_term syntax is as follows in figure 2, where word refers to one or more characters without spaces or punctuation, and phrase refers to multiple words with spaces in between. Asian languages can have phrases made up of multiple words without spaces in between.

Figure 2. simple_term syntax

In keeping with the standard for full-text products, the search function is always case insensitive.

Here are some examples of simple terms used in the context of the CONTAINS predicate in a WHERE clause.

WHERE CONTAINS( context_col, 'hockey' )
WHERE CONTAINS( context_col, ' "ice hockey" ')

Suppose you have one row with a context_col value of "This is a dissertation on the use of ice cream sandwiches as hockey pucks," and another row with the value "Dissertation on new ways of splitting the atom." Since "this," "is," "a," and so on are noise words, they are not stored in the full-text index. Therefore, a query with the CONTAINS predicate:

   CONTAINS (context_col, '"this is a dissertation"')
is the same as this query:
   CONTAINS (context_col, 'dissertation')

Both rows will be returned as hits because in the first query, the noise words are removed before processing the query.

Combining Terms

As with other SQL search conditions, more complex conditions can be specified by linking individual operands with Boolean operators. In this case, the operands are any of the types of terms being discussed. Except for the restrictions that the OR NOT combination is not supported and that NOT cannot be specified before the first term, the rules are exactly the same as those used to combine individual predicates to form search conditions. For instance, parentheses may be used to change the default priority order in which the operators are applied.

Here are some examples of simple terms being combined within a CONTAINS predicate in a WHERE clause:

WHERE CONTAINS( context_col, 'hockey OR curling' )
WHERE CONTAINS( context_col, 'hockey AND NOT field')
WHERE CONTAINS( context_col, 
                ' ("ice hockey" OR curling) AND NOT 
                  Canada ' )

prefix_term

The prefix_term is used to match words or phrases that begin with the specified text.

The prefix_term syntax is as follows in figure 3:

Figure 3. prefix_term syntax

A prefix term consists of a simple term appended with an asterisk (*) to activate prefix matching on the word or phrase. All text that starts with the material before the * is matched. The wildcard symbol (*) in this case is similar to the % symbol in the LIKE predicate in that it matches zero, one or more characters of the root words in a word or phrase. In a phrase, each word within the phrase is considered to be a prefix; for example, the term "local bus *" matches "locality busy," "local bush," and "locale bust."

Here are some examples of prefix terms used in the context of the CONTAINS predicate in a WHERE clause.

WHERE CONTAINS( context_col, ' "atom*" ' )
This matches context_col values that contain the word "atom," "atomic," "atomism," "atomy," and so on.
WHERE CONTAINS( abstract, ' "wine*" OR "vine*" ') 

This matches abstract values that contain the word "wine," or "vine," or, alternatively, words such as "winery," "wines," "vineyard," or "vinegar."

proximity_term

This is used when the words or phrases being searched for must be in close proximity to one another.

The proximity_term syntax is as follows in figure 4:

Figure 4. proximity_term syntax

The proximity term is similar to the AND operator in that more than one word or phrase must exist in the column being searched. It differs from AND because the relevance of the match increases as the words appear closer together.

The syntax is designed to be extensible for possible future support for specification of units of proximity such as words, sentences, paragraphs, chapters, and so on.

NEAR(), NEAR, and ~ share the same meaning: the first word or the phrase is close to the second word or phrase. "Close" is a purposefully vague term that can mean "within 50 words," but the algorithm is complicated. While words within the same sentence are one word distance apart, larger distances are assigned between units such as sentences, paragraphs, and chapters. Even if words or phrases are very far apart, the query is still considered to be satisfied; the row just has a low (zero) rank value. However, if the contains_condition consists only of NEAR proximity terms, then SQL Server does not return rows with a rank value of zero.

It is possible to chain-code the proximity matching. For example, "a ~ b ~ c" means a should be near b, which should be near c. Because of the fuzzy nature of full-text searches, it is often desirable to see the rank values. The CONTAINSTABLE() rowset-valued function can be used to execute queries that return a rank value for each row.

Here are some examples of proximity terms used in the context of the CONTAINS predicate in a WHERE clause.

WHERE CONTAINS( context_col, ' hockey ~ player ' )
This matches context_col values that contain the word "hockey" in close proximity to the word "player."
WHERE CONTAINS( context_col, ' hockey ~ "play*" ') 
This matches context_col values that contain the word "hockey" in close proximity to a word that starts with "play."
WHERE CONTAINS( context_col, '  "great*" 
                              ~ "Mike Nash" ') 

This matches context_col values that contain a word starting with "great" in close proximity to the phrase "Mike Nash."

generation_term

The generation_term is used when the words being searched for need to be expanded to include the variants of the original word.

The generation_term syntax is as follows in figure 5:

Figure 5. generation_term syntax

The INFLECTIONAL keyword means that plural and singular forms of nouns or the various tenses of verbs will be matched. A single term will not match both exclusive noun and exclusive verb forms. The syntax is designed to be extensible enough to handle other linguistically generated forms, such as derivational, soundex, and thesaurus.

Here is an example of a generation term used in the context of the CONTAINS predicate in a WHERE clause.

WHERE CONTAINS(curriculum_vite, 
               ' FORMSOF (INFLECTIONAL, skate) ' )

This matches curriculum_vite values that contain words such as "skate," "skates," "skated," and "skating."

weighted_term

The weighted_term is used for queries that match a list of words and phrases, each optionally given its own weighting. Matching values must match only one element in the list.

The weighted_term syntax is as follows in figure 6, where n.nnn: represents a decimal constant from zero through one:

Figure 6. weighted_term

A row is returned if there is a match on any one of the ISABOUT elements.

Each component in the vector can be optionally weighted. The assigned weight forces a different measurement of the rank value that is assigned to each row that matches the query.

Here are some examples of weighted terms used in the context of the CONTAINS predicate in a WHERE clause.

WHERE CONTAINS( article, 
                ' ISABOUT(hockey, puck, goalie) ' )

This matches article values that contain any of the words "hockey," "puck," or "goalie." The better matches will contain more than one of the words.

WHERE CONTAINS( article, 
         'ISABOUT("Canadian ice hockey" WEIGHT(1.0),
                  "ice hockey" WEIGHT(.5),
                  hockey WEIGHT(.2) ) 
         ' )

This matches article values that may have information about Canadian ice hockey, with higher rank values assigned to articles that have more words from the phrase.

The FREETEXT Predicate

The FREETEXT predicate is used to determine whether or not values in full-text registered columns reflect the meaning, rather than the exact words, specified in the predicate.

The FREETEXT predicate syntax is as follows in figure 7:

Figure 7. The FREETEXT predicate syntax

This is a basic form of natural language query, where the index engine internally breaks the freetext string into a number of search terms, generates the stemmed form of the words, assigns heuristic weighting to each term, and then finds the matches.

Here is an example of a FREETEXT predicate used in a WHERE clause.

WHERE FREETEXT (articles, ' Who have been the most  
    valuable players for the Montreal Canadien? ' ) 

Combining Predicates and Use of Predicates

Because CONTAINS and FREETEXT are SQL predicates, they can be used anywhere that SQL predicates are supported. In particular, they can be combined with each other and with other predicates, such as equality, LIKE, and BETWEEN, to specify extensive search conditions.

The WHERE clause in the following query uses both a CONTAINS predicate and a comparison predicate. It obtains the title and publication year of all the books in the titles table in the pubs database, where the book costs less than $20.00 and text in the notes column indicates that the book is about ice hockey.

SELECT title, DatePart(year, pubdate) 
  FROM pubs
  WHERE price < 20.00
    AND CONTAINS (notes, ' "ice hockey" ') 

The following query uses a CONTAINS predicate within a subquery. It obtains the titles of all the books in the titles table for the publisher who is located close to the flying saucer in Moonbeam, Ontario. This information about the publisher is known to exist in the pr_info column in the pub_info table, and it is also known that there is only one such publisher.

SELECT T.title, P.pub_name 
  FROM publishers P,
       Titles T
  WHERE P.pub_id = T.pub_id
    AND P.pub_id = (SELECT pub_id
                      FROM pub_info
                      WHERE CONTAINS 
                              (pr_info, 
                               ' moonbeam AND
                                 ontario AND
                                 "flying saucer" 
                               ') ) 

The CONTAINSTABLE() Rowset-Valued Function

The CONTAINSTABLE() function is used to issue contains-type full-text queries that return relevance rankings for each row.

The CONTAINSTABLE syntax is as follows in figure 8, where table_ref refers to a full-text registered table, column_ref: refers to a specific full-text registered column, or * refers to all columns in table_ref that are full-text registered.

Figure 8. CONTAINSTABLE() function

contains_condition

is the same as described for the CONTAINS predicate

Although both the CONTAINS predicate and CONTAINSTABLE() function are used for contains-type full-text queries, and the SQL language used to specify the full-text search condition is the same in both, there are major differences in the way these are used:

  • CONTAINS returns a true or false value, and so typically is specified in the WHERE clause of a SELECT statement.

    CONTAINTABLE() returns a table of zero, one, or more rows, and so always must be specified in the FROM clause.

  • CONTAINS can be used only to specify selection criteria that SQL Server uses to determine the membership of the result set.

    CONTAINSTABLE() is also used to specify selection criteria. The table returned has a column named key that contains full-text key values. Each full-text registered table has a column that has values guaranteed to be unique, and the values returned in the key column are the full-text key values of the rows that match the selection criteria specified in the contains_condition. Furthermore, the table produced by the CONTAINSTABLE() function has a column named rank that contains values from zero to 1,000 that can be used to rank the returned rows according to how well they meet the selection criteria.

Queries that use the CONTAINSTABLE() function are more complex than those using the CONTAINS predicate because it is necessary to explicitly join qualifying rows returned by CONTAINSTABLE() with the rows in table_ref.

For example, the contents of some documents reside in the DocText column of the doc_collection table, and the table also contains the StorName, Size, and DocAuthor columns. The unique key column for the table is named DocNo. You want the rows in the result set to be ordered so that those with the highest rank value are returned first.

SELECT Q.StorName, Q.Size, Q.DocAuthor, W.Citizenship 
FROM doc_collection as Q,
     writers as W,
     ContainsTable(doc_collection, DocText,
                   ' "SQL Server" NEAR() text'
                  ) AS K
WHERE Q.DocAuthor = W.writer_name
  AND K.[KEY] = Q.DocNo
ORDER BY K.RANK DESC

To simplify the use of CONTAINSTABLE() for queries that involve only one table and no grouping, use the following fixed template:

SELECT select_list ,KEY_TBL.RANK
FROM table_ref AS FT_TBL
CONTAINSTABLE (     table_ref
                ,   { column_ref | * }
                , ' contains_condition ' )
) AS KEY_TBL
WHERE FT_TBL.key_column = KEY_TBL.[KEY]
      AND predicate  ...
ORDER BY KEY_TBL.RANK DESC

The following example uses the previous template. In this example, the wording on the plaques in the Hockey Hall of Fame reside in the PlaqueWording column of HockeyHall table, and the table also contains the PlayerName, StartYear, and LastYear columns. The unique key column for the table is named PlaqueNo. You want to return the PlayerName, PlaqueNo, and rank values for players who might have played for the Kenora (also called Rat Portage) teams in the early 1900s. The higher ranking rows should be returned first.

SELECT PlayerName, PlaqueNo, KEY_TBL.RANK
FROM HockeyHall AS FT_TBL
     CONTAINSTABLE(HockeyHall, PlaqueWording,
                   ' Kenora OR "Rat Portage" '
                  ) AS KEY_TBL
WHERE FT_TBL.PlaqueNO = KEY_TBL.[KEY]
  AND StartYear < 1915 AND EndYear > 1899 
ORDER BY KEY_TBL.RANK DESC

The FREETEXTTABLE() Rowset-Valued Function

The FREETEXTTABLE() function is used to issue freetext-type full-text queries that return relevance ranking for each row.

The FREETEXTTABLE() syntax is as follows in figure 9:

Figure 9. FREETEXTTABLE() function

This function is used in the same manner as a CONTAINSTABLE() function, and the search condition is the same as specified by a FREETEXT predicate.

Textual Searches against File-System Data

For a fuller discussion, see: "Textual Searches on File Data Using Microsoft SQL Server 7.0".

It is important to understand that both facilities exist. It also is worth noting that the ability to search against file-system data only already exists in Index Server 2.0. This capability is further enhanced by the SQL Server 7.0 Distributed Query Processor, which, in conjunction with the Microsoft OLE DB Provider for Index Server 2.0, can be used to issue both property and full-text searches against file-system data and to join the results with data in the database.

By recasting the first example in this document, we will introduce searches against the file system. The following query selects the names, sizes, and authors of all Microsoft Word files on drive D, where the document contains the phrase "SQL Server" in close proximity to the word "text" and joins this result with the writers table to obtain the author's citizenship.

SELECT Q.FileName, Q.Size, Q.DocAuthor, W.Citizenship 
FROM OpenQuery(MyLinkedServer, 
               'SELECT FileName, Size, DocAuthor
                from SCOPE('' "D:\" '')
                WHERE CONTAINS(''"SQL Server"
                                 NEAR() text'')
                AND FileName LIKE ''%.doc%'' '
              ) as Q,
     writers as W
WHERE Q.DocAuthor = W.writer_name

The location of the data must be known to the person writing the queries. The SQL standards organization is developing a new datalink data type, which will allow seamless queries that access data both inside and outside the database. This emerging definition offers some interesting possibilities.

Component Architecture

There are two aspects to the component architecture: indexing components and query components.

Indexing Components

The indexing components manage the initial population and subsequent updating of the full-text indexes.

Before presenting the steps involved in the indexing process, various subcomponents will be introduced, and then the flow among the subcomponents will be described.

Enterprise Manager User Interface

This is the GUI for the full-text indexing administration utility. It takes the form of extensions to the current Microsoft SQL Server Enterprise Manager property sheets, plus a new wizard. These have been designed to assist first-time and infrequent administrators of full-text indexed tables. The GUI also is appropriate for frequent users. It allows a user to select the desired tables for full-text indexing and walks the user through the various steps needed to set this up.

One option is to schedule regular refreshing of full-text indexes; this uses the scheduling facilities that already exist in SQL Server. The GUI also can be used to display properties of the full-text indexed tables. This GUI uses a set of new full-text administration system stored procedures, a set of full-text related properties available through the SQL Server property functions, as well as the existing SQL Server Agent job scheduler.

Full-Text Administration System Stored Procedures

This set of system stored procedures is used to:

  • Set up full-text catalogs (a full-text catalog is an entity holding a collection of full-text indexes)
  • Register a table and the selected columns within the table for full-text search
  • Request the population of the indexes within a full-text catalog
  • Undo any of the previous

The setup and configuration capabilities of full-text indexing reside in the stored procedures, not the GUI. It is possible to use these stored procedures directly rather than using SQL Server Enterprise Manager, and experienced administrators may choose to do this. Calls to the stored procedures also may be embedded in scripts and other stored procedures.

Scheduling Stored Procedures

The set of stored procedures (such as sp_add_jobschedule) that is used for scheduling regular jobs also can be used to schedule refreshes of full-text catalogs. See figure 10 below:

Figure 10. Scheduling stored procedures

Full-Text Index Additions to System Tables

The sysdatabases system catalog has been extended to provide the new IsFulltextEnabled property, which is available through the DATABASEPROPERTY() function.

A new system table, sysfulltextcatalogs, is now present for each database. This table holds metadata about the full-text catalogs that are linked to the database. A given full-text catalog is linked to only one database.

The sysobjects, sysindexes, and syscolumns tables that reside in each database will be augmented with information about the tables and the associated columns in that database that have been full-text indexed. The information is:

  • The ftcatid column of user table rows in sysobjects, which points to rows in sysfulltextcatalogs
  • The TableHasActiveFulltextIndex property, which is available through the OBJECTPROPERTY() function
  • The TableFulltextKeyColumn property, which is the column ID of the full-text unique key column and is available through the OBJECTPROPERTY() function
  • The TableFulltextCatalogId property, which is the full-text catalog ID associated with the table and is available through the OBJECTPROPERTY() function
  • The IsFulltextKey property, which is available through the INDEXPROPERTY() function
  • The IsFulltextIndexed property, which is available through the COLUMNPROPERTY() function

Microsoft Search Service

This is a Windows NT service that has two roles:

  • Indexing support accepts requests to populate the full-text index of a given table
  • Querying support processes full-text searches

Microsoft Search Service operates under the context of the local system account. This service must be run on the same computer as SQL Server.

Indexing Support

The information about populating full-text indexes is submitted in the form of a population start seed value, which uniquely identifies both the database and the table that needs to be full-text indexed. When the service is ready to handle the population, it invokes the SQL Server Handler driver.

SQL Server Handler

This subcomponent is a driver that plugs into indexing support and is specially coded to handle SQL Server data stores.

SQL Server Handler always runs in the same process as Microsoft Search Service.

Index Engine

This subcomponent presents indexable units of text (in the form of character strings), each with an identifying key. It scans through character strings, determines the word boundaries, removes all noise words, and then populates a full-text index with the remaining words.

Full-Text Catalogs

This is where full-text indexes reside. This is a Windows NT file-system directory that is accessible only by Windows NT Administrator and Microsoft Search Service. The full-text indexes are organized into full-text catalogs, which are referenced by friendly names. Typically, the full-text index data for an entire database is placed into a single full-text catalog. However, administrators can partition the full-text index data for a database across more than one full-text catalog. This is useful if one or more of the tables being full-text indexed contains a large number of rows. A full-text catalog is limited to holding index data for 231 to 64K rows.

In SQL Server 7.0, there are no facilities to coordinate backup and recovery relational database data in conjunction with the relevant indexes in the full-text catalogs (however there are facilities to resynchronize them). Coordinated backup and recovery is a candidate for a future release of SQL Server.

Full-Text Indexing Administration

The following steps are started by an administrator using GUIs or stored procedures (either on demand or as scheduled).

  1. First, enable the database for full-text search and then identify the tables and columns that are to be registered for full-text search. This information is stored in the SQL Server system tables.

  2. When a table is activated for full-text processing, a population start seed for that table is sent to indexing support.

  3. Next, request the initial population of the full-text index for a table. Actually, the granularity of population is a full-text catalog, so if more than one table has been linked to a full-text catalog, the result is the full population of the full-text indexes for all tables linked to that catalog.

  4. The knowledge of the tables and rows that require indexing resides in SQL Server, so when indexing support receives a population request for a full-text catalog, it calls back into SQL Server to obtain data from all the columns in the table that have been marked for indexing. When this data arrives, it is passed to the index engine where it is broken into words. Noise words are removed and the remaining words are stored in the index.

  5. Full-text indexes are kept current by using a GUI that sets up a schedule for periodic refreshes of a full-text catalog. This GUI uses stored procedures from the SQL Server job scheduler. It also is possible to request a refresh at any time, either by means of a GUI or by direct use of a stored procedure.

    If a table has a row-versioning (timestamp) column, repopulation can be handled more efficiently. At the time the population start seed for a table is constructed, the largest row-versioning value in the database is remembered. When an incremental population is requested, the SQL Server handler connects to the database and requests only rows where the row-versioning value is greater than the remembered value.

    During an incremental population, if there is a full-text indexed table in a catalog that does not have a row-versioning column, then that table will be completely repopulated.

    There are two cases where a complete repopulation is performed, even though there is a row-versioning column on the table. These are:

    • In tables when the schema has changed
    • In tables activated since the last population

    After populating tables with a row-versioning column, the remembered row-versioning value is updated.

    Since incremental repopulation on relatively static tables with timestamp columns can be completed faster than a complete repopulation, users can schedule repopulation on a more frequent basis. For a given database, users should take care not to mix index data from tables with timestamp columns with tables in the same full-text catalog, because these two groups of tables usually should be on separate repopulation schedules.

  6. Additional tables and columns will be registered for full-text search, and full-text indexes will be generated for them. The full-text search capability may be removed from some tables and columns. Some full-text catalogs may be dropped. This step may be repeated several times.

Full-Text Query Component Architecture

The query components accept a full-text predicate or rowset-valued function from SQL Server, transform parts of the predicate into an internal format, and send it to Microsoft Search Service, which returns the matches in a rowset. The rowset is then sent back to SQL Server. SQL Server uses this information to create the result set that is then returned to the submitter of the query.

The SQL Server Relational Engine accepts the CONTAINS and FREETEXT predicates as well as the CONTAINSTABLE() and FREETEXTTABLE() rowset-valued functions. During parse time, this code checks for conditions such as attempting to query a column that has not been registered for full-text search. If valid, then at run time, the ft_search_condition and context information is sent to the full-text provider. Eventually, the full-text provider returns a rowset to SQL Server, which is used in any joins (specified or implied) in the original query.

The Full-Text Provider parses and validates ft_search_condition, constructs the appropriate internal representation of the full-text search condition, and then passes it to the search engine. The result is returned to the relational engine by means of a rowset of rows that satisfy ft_search_condition. The handling of this rowset is conceptually similar to the code used in support of the OPENROWSET() and OPENQUERY() rowset-valued functions.

Search Engine

This subcomponent processes full-text search queries. It determines which entries in the index meet the selection criteria. For each entry that meets the selection criteria, the value of the unique key column and a ranking value are returned.

The Full-Text Query Process

  1. A query that uses one of the full-text constructs (that is CONTAINS, FREETEXT, CONTAINSTABLE(), or FREETEXTTABLE) is submitted to the SQL Relational Engine.
  2. Queries containing either the CONTAINS or FREETEXT predicate are rewritten so that the rowset returned from the Full-Text Provider later will be automatically joined to the table that the predicate is acting upon. This rewrite is the mechanism used to ensure that these predicates are a seamless extension to SQL Server. By specifying CONTAINS and FREETEXT, you do not have to concern yourself with or even know about the details of the underlying invocation of the Microsoft Search Service.
  3. The Full-Text Provider is invoked, passing the following information:
    • The ft_search_condition

    • The friendly name of the full-text catalog where the full-text index of a table resides

    • The locale ID to be used for language (for example, word breaking)

    • The identities of the database, table, and column

      If the query is comprised of more than one full-text construct, the full-text provider is invoked separately for each construct.

      The SQL Relational Engine does not look into the contents of the ft_search_condition. Instead, this is passed along to the full-text provider, which checks it for validity and then creates the appropriate internal representation of the full-text search condition.

  4. The command is passed to Querying Support.
  5. Querying Support returns a rowset that contains the unique key column values for the rows that match the full-text search criteria. A rank value also is returned for each row.
  6. The rowset is passed to the SQL Relational Engine. If processing either a CONTAINSTABLE() or FREETEXTTABLE() function, RANK values are returned; otherwise, the rank value is filtered out.
  7. The rowset values are plugged into the query with values obtained from the relational database, and the result set is returned to the user.

Administration

In the current implementation, full-text indexes are different from classic SQL Server indexes. These differences give rise to the need for the administrative tasks outlined here. Table 2 below contains a summary of the main differences:

Table 2. Classic SQL indexes vs. full-text indexes

Classic SQL Indexes Full-Text Indexes
Stored in and under the control of the database Stored in the file system, but administered via the database
Several indexes per table Only one full-text index definition per table
Automatically updated when the data upon which they are based is inserted, updated, or deleted Population must be requested (either as scheduled or on demand)
Not grouped Grouped
Created and dropped using SQL statements Created, managed, and dropped using stored procedures (either directly of by means of GUIs)

Full-Text Administration is carried out at several different levels:

  • Certain server-wide properties can be set, such as those required for Microsoft Search Service to initialize and prepare for index population
  • A database must be enabled to use full-text search (metadata for one or more full-text catalogs can be created and dropped in an enabled database)
  • A full-text catalog must be populated using administrative facilities
  • A table must be registered as supporting full-text searches, at which time metadata is created for the full-text index of that table
  • A registered table must be activated before it can participate in the next full-text catalog population
  • Columns that support full-text searches can be added or dropped from an inactive registered table

At all of these levels, there are facilities to retrieve information on property, status, and volume (for example, size of the full-text catalog). These facilities take the form of both GUIs and stored procedures, and almost all tasks can be accomplished either way. This document will concentrate on the stored procedures and only give a few examples of the use of GUIs. The term "stored procedures" has been used to illustrate the level at which an administrator is communicating with SQL Server. In fact, a combination of stored procedures and scalar-valued property functions is used.

Stored Procedures Scenario

The following is an outline of the steps that an administrator operating in the "stored procedure mode" could take to set up full-text search on selected tables and columns in the pubs database. This scenario will not demonstrate everything that is available, but only enough to provide insight as to how these procedures can be used.

  1. Check that Microsoft Search Service is running by looking at the shape or color of the icon for full-text search in SQL Server Enterprise Manager.

    If necessary, the service can be started in one of several ways:

    • Through the context menu of the full-text search object in SQL Server Enterprise Manager
    • Outside SQL Server through the Service Control Manager (in the Service Control Manager, this service is called Microsoft Search Service)
    • From a MS-DOS® prompt by typing "net start mssearch"
    • From SQL Server service manager

    The full-text service also can be stopped in all these places.

    In this case, pubs has not been enabled.

  2. Determine if the pubs database has been enabled for full-text processing. The SQL statement SELECT DatabaseProperty('pubs', 'IsFulltextEnabled') returns 1 if the service has been enabled, and 0 if it has not. In this case, the service has been enabled.

  3. Connect to the pubs database by executing the Transact-SQL statement:

    USE pubs
    

  4. Enable pubs for full-text search by invoking the stored procedure:

    sp_fulltext_database 'enable'
    

  5. Create a full-text catalog named PubsCatalog, opting for the default directory. This is done by invoking the stored procedure:

    sp_fulltext_catalog 'PubsCatalog', 'create'
    

    This procedure creates metadata about a full-text catalog in the system table of the database and builds an empty full-text catalog in the file system. The file is created in a default root directory that can be overridden, if desired, by using a third parameter.

  6. Register the authors, jobs, pub_info, and titles tables for full-text processing. Tables so registered must have a column (called the full-text unique key column) that is guaranteed to have a unique value for each row. Because all these tables have a primary key that consists of a single column, they all qualify.

    To register such a table, you must specify the name of the index that enforces the unique value for the unique key column. For a given table, this information can be obtained using the sp_helpindex stored procedure. The indexes of interest in this scenario are as listed in Table 3:

Table 3. Registering tables indexes: index names and values in key columns

Index Name Values in Key Columns
Authors UPKCL_auidind
Jobs PK_jobs_22AA996
pub_info UPKCL_pubinfo
titles UPKCL_titleind

Knowing these names, it is now possible to register the tables by invoking the sp_fulltext_table stored procedure once for each table:

sp_fulltext_table 'authors', 'create', 'PubsCatalog', 'UPKCL_auidind'
sp_fulltext_table 'jobs', 'create', 'PubsCatalog', 'PK_jobs_22AA996'
sp_fulltext_table 'pub_info', 'create', 'PubsCatalog', 'UPKCL_pubinfo'
sp_fulltext_table 'titles', 'create', 'PubsCatalog', 'UPKCL_titleind'

The effect of these invocations is to update the metadata in the system tables both for this full-text catalog and for these tables.

  1. For each of the newly registered tables, specify the names of the columns that are to be registered. This is done by invoking the sp_fulltext_column stored procedure once for each column:

    sp_fulltext_column 'authors', 'address', 'add'
    sp_fulltext_column 'jobs', 'job_desc', 'add'
    sp_fulltext_column 'pub_info', 'pr_info', 'add'
    sp_fulltext_column 'titles', 'type', 'add' sp_fulltext_column 'titles', 'notes', 'add'

    The effect of these invocations is to augment the metadata in the system tables.

    Note: A mistake was made for the sake of later illustration. For the titles table, the type column, rather than the titles column has been registered.

  2. Before a full-text index can be created, it must be active. Activate the ability to create a full-text index for these tables by invoking the sp_fulltext_table stored procedure once for each table:

    sp_fulltext_table 'authors', 'activate'
    sp_fulltext_table 'jobs', 'activate' sp_fulltext_table 'pub_info', 'activate'
    sp_fulltext_table 'titles', 'activate'

    This does not create the full-text indexes; it only registers the tables as active in the metadata of the full-text catalog so that data from these tables will be included in the next population. Additionally, this defines full-text population start seeds for each table to the full-text service.

  3. Start a full population of the PubsCatalog full-text catalog by invoking the sp_fulltext_catalog stored procedure:

    sp_fulltext_catalog 'PubsCatalog', 'start_full'

    The population of a full-text catalog is an asynchronous operation. This means that immediately following the execution the procedure and the return to the procedure's caller, it is unlikely that the full-text indexes will have been created yet.

  4. Inquire into the progress of the population of the PubsCatalog full-text catalog. This statement:

    SELECT FulltextCatalogProperty ('PubsCatalog', 'PopulateStatus')
    

    returns 0 if the service is idle for the full-text catalog and therefore (supposedly) finished, and 1 or more to indicate various stages of population.

  5. Issue some SQL queries to confirm that the administration was executed properly. For example, issue the following SQL statement:

    SELECT P.pub_name, T.title, T.price
    FROM publishers P, titles.T
    WHERE P.pub_id = T.pub_id
    AND P.country = 'England'
    AND CONTAINS (T.notes,
    '"case is altered" OR
    "cat and custard pot" OR
    "the monarch and the sphinx"
    ' )
    

  6. Issue the following SQL query:

    SELECT title_id, title, pubdate
    FROM titles
    WHERE CONTAINS (T.title,
    'classic ~ french ~ cooking')
    

    This results in an error because the title column was not registered for full-text queries.

  7. Check for mistakes by using this statement, which returns 1 if title is part of the full-text index for the books table, and 0 if it is not.

    SELECT ColumnProperty ( ObjectId('titles'),
    'titles',
    'IsFulltextIndexed' )
    In this case, the value returned is 0.
    

  8. List the columns that participate in full-text processing for titles by invoking the stored procedure:

    sp_help_fulltext_columns 'titles'
    

    The results of this query will show that there was a mistake and that type instead of title was included in the full-text index definition.

  9. Deactivate the titles table so that the title column can be added to the full-text index and the type column can be removed. This is done by invoking the stored procedure:

    sp_fulltext_table 'titles', 'deactivate'
    

    In addition to allowing columns to be added and deleted, the effect of deactivating the titles table means that the table no longer participates in the population of the PubsCatalog full-text catalog. However, the metadata remains and the table can be reactivated. The existing full-text index for the titles table remains in place until the next full population of PubsCatalog, but it is unused because SQL Server blocks queries on deactivated tables.

  10. Add the title column and remove the type column from the metadata for the title table's full-text index. This is done by invoking the sp_fulltext_column stored procedure once for each column:

    sp_fulltext_column 'titles', 'type', 'drop'
    sp_fulltext_column 'titles', 'title', 'add'
    

  11. Reactivate the books table by invoking the stored procedure:

    sp_fulltext_table 'titles', 'activate'
    

  12. Start an incremental population of the PubsCatalog full-text catalog by invoking the stored procedure:

    sp_fulltext_catalog 'PubsCatalog',
    'start_incremental'
    

    An incremental population will refresh the full-text catalog by indexing data in the full-text enabled columns that meet any of the following criteria:

    • Data from rows that have been updated or inserted since the last population in tables that have a timestamp column
    • Data from all rows in tables that do not have a timestamp column, tables that were enabled for full-text processing since the last population, or tables whose schemas have been modified in any way since the last population
  13. After the repopulation of PubsCatalog is complete, reissue the query from step 12 of this procedure above. This time, no error is raised.

Graphical User Interfaces

GUIs are simple and straightforward to use. This document concentrates on showing how they fit into the rest of GUI for the SQL Server Enterprise Manager.

Console Pane

Figure 11. Console pane

There are two full-text objects on the console (left) pane:

  • Full-Text Search: Right-clicking on this brings up a menu that has options such as starting and stopping the full-text search service.
  • Full-Text Catalogs: Selecting this causes a list of full-text catalogs for a database to be displayed in the details (right) pane. Right-clicking this brings up a context menu that has options such as creating new full-text catalogs and rebuilding all the full-text catalogs for a database.

Tools Menu

Figure 12. Tools menu

The option of interest here is Full-Text Indexing. Selecting this launches the Full-Text Indexing Wizard.

Context Menu of a Typical Table

Figure 13. Context menu

The Full-Text Index Table menu option has suboptions that can be used to launch the Full-Text Indexing Wizard to define or to modify the full-text index specifications for a table or to remove full-text indexing from the table.

Property Sheet Tabs of a Typical Table

Selecting the Full-Text Indexing tab selects a property page. Figure 14 below shows a typical page:

Figure 14. Property sheets tabs

Catalog Pane and Context Menu

Figure 15. Catalog pane and context menu

This menu was obtained by the following steps:

  1. Select the Full-Text Catalogs object on the Console Pane A to obtain the list of full-text catalogs.
  2. Right-click the fcp.northwind.ft.ctlg to obtain the context menu that is shown in the screen fragment.

Select the Properties item to get the Full-Text Catalog Properties dialog box.

Status Property Page

Figure 16. Status property page

Schedules Property Page

Figure 17. Schedules property page

This property page was obtained by selecting the Schedules option.

Full-Text Indexing Wizard

This wizard may be invoked from many places: The two that we have shown here are by selecting the Full-Text Indexing… option from the Tools Menu B or one of the Full-Text Index Table options from the Context Menu C. The wizard gathers all the information necessary to create and maintain a full-text index for a given table.

Registering and activating a table does not mean that the full-text queries can be issued against the table. A full-text index must first be created, and this is not a function of the wizard. After registering one or more tables, start a full population of the full-text catalogs for those tables. This can be done by selecting the Start Population option on the Pubs Catalog context menu.

Figure 18. Full-text indexing wizard

Conclusion

The full-text search features supported by Microsoft SQL Server 7.0 are entry-level, but still serve many useful full-text searching purposes for users.

It is significant that even at entry level, SQL Server 7.0 provides the ability to include data both inside and outside the database. The SQL Server 7.0 distributed query processor enables advanced capability by joining full-text query results in the database with results generated by queries against Index Server 2.0 in the file system. An excellent foundation has been paved during the integration of the Microsoft information retrieval technologies into SQL Server 7.0, technologies that have already shipped in several Microsoft products and that will serve as the basis for other Microsoft products that support text search.

Additionally, the component reuse strategy in SQL Server 7.0 benefits the customer by providing seamless upgrade capability in the underlying index and search engine, and by providing cross-product consistency.

---------------------------------------------

The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.

This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT.

©1998 Microsoft Corporation. All rights reserved.

Microsoft, the BackOffice logo, MS-DOS, Windows, and Windows NT are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.

Other trademarks and tradenames mentioned herein are the property of their respective owners.

The names of companies, products, people, characters, and/or data mentioned herein are fictitious and are in no way intended to represent any real individual, company, product, or event, unless otherwise noted.

Part number: 098-82314

© Microsoft Corporation. All rights reserved.