Extensions to SQL Server to Support Full-Text Search

Article
01/28/2010

Updated : July 19, 2001

by Margaret Li (Program Manager, Data Access Group) and Frank Pellow (Program Manager, SQL Server Relational Engine)

Abstract

Microsoft® SQL Server™ 7.0 introduces facilities that support textual queries on data residing in SQL Server itself as well as textual queries on data in the file system.

This paper describes the first of these capabilities, that is support of textual queries against data within SQL Server tables. We begin by introducing the Full-Text Search concepts, followed by the form that a full-text search query would take and the type of information that can be retrieved via such queries. We then take the reader through a fairly technical presentation of the internal design and architecture of the Full-Text Search system, and then proceed to describe how this system can be administered via stored procedures and/or graphic user interfaces available through SQL Server Enterprise Manager.

A reading of this paper is intended to give a good overview of the capability of SQL Server 7.0 Full-Text support and a good grasp of how the various sub-components interact to provide this support.

Support of textual queries against data in the file system is the subject of another paper.

Introduction

Today, a very large portion of digitally stored information is still in the form of unstructured data, primarily text. While the bulk of this text data is stored in the file system, some corporations have begun to manage it by storing it in relational databases in character-based columns such as VARCHAR, and TEXT. What this means is that relational database users now need a mechanism to effectively retrieve textual data from the database itself. Traditional RDBMSs, such as Microsoft SQL Server 6.5, were not designed for efficient full-text retrieval. For example, while SQL Server 6.5 has some capabilities of retrieving text based on pattern matching, it cannot handle searches that look up words and phrases in close proximity to one another.

There are two major types of textual queries:

Property Search: Apply filters to documents in order to extract properties such as author, subject, type, word count, printed page count.

Full-Text Search: Create indexes of all nonnoise-words in the documents, then use those indexes to support linguistic searches and proximity searches.

The lack of integration of textual query facilities into relational databases has forced customers to use third party offerings to address these needs. These solutions usually involve pulling data out of the database via bridges or gateways and storing the data as files in the file system so that full-text indexing can be applied. This does not provide a seamless way for a user to combine a full-text query with a regular structured relational query.

Some relational database products now offer customers the ability to specify relational and full-text search conditions seamlessly integrated into the same query. We will show how such a query might be specified. Given that the contents of a set of plain text documents reside in the DocText column of a table named 'doc_collection' and that the table also contains columns for StorName, Size, and DocAuthor, it should be possible to issue the following query:

SELECT Q.StorName, Q.Size, Q.DocAuthor, W.Citizenship 
FROM doc_collection as Q,
     writers as W
WHERE CONTAINS(DocText, ' "SQL Server" NEAR() text') 
  AND Q.DocAuthor = W.writer_name

in order to obtain:

the names, sizes, and authors of all documents,
where the document contains the phrase "SQL Server" in close proximity to "text",
and this is joined with the 'writers' table to obtain the author's citizenship.

Over the last 6 months we have been engaged in a project that will make it possible to issue such queries. The objective is to introduce Microsoft's Text Search technology into SQL Server 7.0 Beta 3 in order to provide users with an entry level full-text queries against plain text data in relational database tables. The syntax is to be a natural extension to the SQL language.

We have leveraged existing technologies within Microsoft to provide full-text search against SQL Server data. Microsoft's Information Retrieval technologies have been around for some time and have already been shipped with Index Server 2.0 and Site Server 3.0. We have combined various components within these technologies and integrated them with SQL Server 7.0 to provide Microsoft's relational database customers with full-text retrieval support Two of the technologies outside of SQL Server that we have been able to integrate are:

The Microsoft Search Service1, a full-text indexing and search service which is referenced by both the terms "Index Engine" and "Search Engine" throughout this paper.
The parser component of the OLE DB Provider for Index Server 2.0 which accepts full-text SQL extensions and maps them into a form that can be processed by the search engine.

This paper has three purposes:

to introduce the extensions to the SQL language in support of full-text search on database data,
to show how the various technologies have been brought together to enable full-text search on database data,
to introduce SQL Server's new administrative facilities in support full-text search on database data.

Full-Text Search Concepts

There are 4 major aspects to supporting full-text retrieval of plain text data for a given database:

Managing the definition of the tables and columns that are registered for full-text search.
Indexing the data in the registered columns. The indexing process scans through the character streams, determines the word boundaries (this is known as word breaking), removes all noise-words2, then populates a full-text index with the remaining words.
Issuing queries against the registered columns for populated full-text indexes.
Ensuring that subsequent changes to the data in registered columns get propagated to the index engine in order to keep the full-text indexes in sync.

The principle underlying design requirement for the indexing, querying, and synching processes above is the presence of a full-text unique key column (or single column primary key) on all tables that are registered for full-text search. The full-text index contains an entry for the non noise-words in each row together the value of the key column for each row.

When processing a full-text query, the search engine returns the key values of the rows that match the search criteria to SQL Server. A very simple example best illustrates what's happening. Suppose we have a SciFi table with the following columns where Book_No column is the primary key column:

Book_No	Writer	Title
A025	Asimov	Foundation's Edge
A027	Asimov	Foundation and Empire
C011	Clarke	Childhood's End
V109	Verne	Mysterious Island

And, suppose we have a full-text retrieval query where we want to find the book titles having the word "Foundation" in them. When the SQL Server Relational Engine encounters a full-text retrieval predicate, it calls the Full-Text Search component to retrieve the values of Book_No that satisfy the text retrieval filter condition. In this case, the values of A025 and A027 are returned. The Relational Engine then uses this information together with other information under its direct control in order to respond to the query.

Unlike "classic" relational database indexes, traditionally full-text indexes are not instantly modified when values in full-text registered columns are updated, when rows are added to full-text registered tables, or when rows are deleted from full-text registered tables. Rather, full-text indexes are usually re-populated asynchronously. There are two reasons for this:

It typically takes significantly more time to update a full-text index than a classic index.
Full-Text searches are usually fuzzy by nature so do not need to be as precise as classic searches.

During the re-population process, the unique key column values are passed to the index engine to identify those items that need to be re-indexed. For example, if the Title associated with V109 gets changed to 'Mystery Island', then the index should be modified in order to reflect this new value.

The full-text administration process starts with designating a table and its columns of interest for full-text search. Either graphical user interfaces (GUIs) and wizards or built-in stored procedures are first used to register tables and columns as eligible for full-text search. Then, a separate request (again via a GUI or stored procedure) is issued to populate the full-text indexes.

The end result is that the underlying index engine gets invoked and asynchronous index population begins. Full-Text indexing consists of keeping track of which significant words are used and where they are located. For example, a full-text index might indicate that the word 'Microsoft' is found at word number 423 and word number 982 in the 'Abstract' column of the 'DevTools' table for the row associated with a 'ProductID' of 6. This index structure supports an efficient search for all items containing indexed words, and advanced search operations such as phrase searches and proximity searches. An example of a phrase search is looking for white elephant - a search for places where white is followed by elephant. An example of a proximity search is looking for places where big occurs near house. To prevent the full-text index from becoming bloated with words that do not help the search, noise-words, such as a, and, the are ignored. Noise-word lists for many languages are provided and the set of supported languages is growing. The choice of a particular noise-word list is based on the language setting of the database server. These noise-word lists should be sufficient for most normal operations, but can be modified for specific environments. The noise-word lists are available on a published path and administrators can use a regular text editor to modify the contents of any of these lists. For example, high-tech computer companies might want to add the word 'computer' to their noise-word list.

Extensions to Microsoft's Transact-SQL needed to be made so that users would be able to pose full-text queries. Here we were able to utilize the syntax of the CONTAINS and FREETEXT predicates already supported for full-text search in both the Microsoft OLE DB Provider for Index Server 2.0 and the Microsoft OLE DB Provider for Site Server 3.0 Search.

The CONTAINS predicate is used to search for:

a word or a phrase
the prefix of a word or phrase
a word or phrase that is near another
a word that is an inflectional form of another (e.g., drive is the inflectional stem of drives, drove, driving, driven)
a set of word or phrases, each of which is assigned a different weighting

The FREETEXT predicate is a simple form of natural language query. Any text, including words, phrases, or sentences, can be specified in the query. The search engine matches values that reflect the meaning, rather than the exact wording, of the query.

The SQL Server Relational Engine recognizes CONTAINS and FREETEXT and performs some minimal syntax and semantic checking, such as ensuring that the column referenced in the predicate has been registered for full-text search. During query execution, a full-text predicate and other relevant information are passed to the Full-Text Query component. After further syntax and semantic validation, the search engine is invoked and it returns the set of unique key values identifying those rows in the table that satisfy the full-text search condition.

There are currently two mechanisms available to an administrator for keeping a full-text index synchronized with changes to the data in its table:

It is possible to request a re-population at any time.
Re-population can be set up on a regular schedule.

Both GUIs and stored procedures are provided for these tasks. If a table has a column with a TIMESTAMP data type, it is possible to request an incremental re-population where only changes to the table since the last [re-]population participate in the re-population. If a table does not have a column of type TIMESTAMP, then only a complete re-population is possible.

The alternative of automatic update of full-text indexes based on log activity is a candidate in a future release.

Transact-SQL Extensions for Full-Text Queries

First of all, a point worth repeating is that these extensions are consistent with the SQL supported for full-text search in Index Server 2.0. Furthermore, the SQL support for full-text search follow the ISO SQL-3 functional methodology for full-text syntax extensions.

The primary extension to Transact-SQL consists of the new CONTAINS and FREETEXT predicates. These predicates are used to find column values that match special full-text query criteria.

In addition, the new ContainsTable() and FreetextTable() rowset-valued functions are supported. These functions can be specified (in the FROM clause) to return a two-column table where one column uniquely identifies the rows that match the specified full-text query criteria and the other contains rank values showing the degree to which the rows match the criteria.

We turn first to the new predicates. To be consistent with similar features in other products and to make these predicates more extensible, we have chosen to use functional notation. The high level syntax is:

 ·--CONTAINS--(column_ref[s]--,ft_search_condition)--·
   +FREETEXT+

In SQL Server 7.0, the language is always implied by the locale of the database holding the data. The flexibility of the functional style of these two predicates allows easy, upward compatible future extension for a third parameter to designate the language to be used for the query.

These predicates can be used with the "character family" of data types. By "character family", we mean CHAR, VARCHAR, TEXT, NCHAR, NVARCHAR, and NTEXT. Currently, document formats often stored in column defined with the "binary family" of data types cannot be indexed or queried. Such support will be considered after SQL Server 7.0.

CONTAINS predicate

This predicate is used to determine whether or not values in full-text registered columns contain certain words and phrases. Currently, these predicates can only reference base tables.

The syntax is:

   ·--CONTAINS--(---column_ref------------------------> 
                  +-*----------+ 
   >--,--'--contains_condition--'--)-----------------·
 contains_condition :
·--------simple_term------------------------------->
      |  +--prefix_term------|     |
      |  +--proximity_term---|     |
      |  +--generation_term--|     |
      |  +--weighted_term----+     |
|                            |
      +--(--contains_condition--)--+
     +----------------------------------------------+
     v                                              |
   >--------------------------------------------------·
      +----OR--------------simple_term-------------+
         +-AND-----|  | +--prefix_term------|    |
         +-AND NOT-+  | +--proximity_term---|    |
                      | +--generation_term--|    |
                      | +--weighted_term----+    |
                      |                          |
                      +-(-- contains_condition-)-+

where:

column ref or *	Identifies the column or columns to be searched.
*column ref*	Refers to a specific column that is full-text registered.
*	Refers to all columns in the table that are full-text registered.

simple_term, prefix_term, proximity_term, generation_term, and weighted_term are explained below.

AND, OR, and NOT are supported as one would expect (see below).

simple_term :

This is used in order to match the exact word or phrase being searched for.

The syntax is:

   ·----word-------------------------------·
      |               |
      +-"--phrase--"--+

where:

word Refers to one or more characters without spaces or punctuation.

Phrase Refers to multiple words with spaces between.

In keeping with the norm for full-text products, the search for characters in the word or phrase is always case-insensitive.

Some examples of simple_term used in the context of the CONTAINS predicate in a WHERE clause follow:

WHERE CONTAINS( context_col, 'hockey' )
WHERE CONTAINS( context_col, ' "ice hockey" ')

Suppose we have one row with a context_col value of "This is a dissertation on the use of frozen meadow muffins as hockey pucks" and another row with the value "Dissertation on new ways of splitting the atom". Since "this", "is", "a", etc. are all noise-words, they are not stored in the full-text index. Therefore, a query with the CONTAINS predicate:

   CONTAINS (context_col, '"this is a dissertation"')

is the same as:

   CONTAINS (b_column, 'dissertation')

and both rows will be returned as hits. This is because in the first query, the noise-words "this", "is", and "a" are thrown away before processing the query.

Combining terms

As with other SQL search conditions, more complex conditions can be specified by linking individual operands with Boolean operators. In this case, the operands are any of the types of terms being discussed. Except for the restrictions that the OR NOT combination is not supported and that NOT cannot be specified before the first term, the rules are exactly the same as those used to combine individual predicates to form search conditions. For instance, parentheses my be used to change the default priority order in which the operators are applied.

Some examples of simple_term being combined within a CONTAINS predicate in a WHERE clause follow:

WHERE CONTAINS( context_col, 'hockey OR curling' )
WHERE CONTAINS( context_col, 'hockey AND NOT field')
WHERE CONTAINS( context_col, 
                ' ("ice hockey" OR curling) AND NOT 
                  Canada ' )

prefix_term :

This is used in order to match words or phrases that begin with the specified text.

The syntax is:

   ·----"--word--*--"------------------------·
      |                  |
      +-"--phrase--*--"--+

A prefix_term consists of a simple_term appended with an asterisk in order to turn on prefix matching on the word or phrase. All text that starts with the material before the * is matched. The * wildcard symbol in this case is somewhat like the % symbol in the LIKE predicate in that it matches zero, one or more characters (of the root word(s) in the word or phrase). In the case of a phrase, each word within the phrase is considered to be a prefix, for example, the term "local bus *" matches "locality busy", "local bush" and "locale bust".

Some examples of prefix_term used in the context of the CONTAINS predicate in a WHERE clause follow:

WHERE CONTAINS( context_col, ' "atom*" ' )

This matches context_col values that contain the word 'atom', 'atomic', 'atomism', 'atomy', etc.

WHERE CONTAINS( abstract, ' "wine*" OR "vine*" ')

This matches abstract values that contain the word 'wine' or 'vine' or alternatively, a word such as 'winery', 'wines', 'vineyard', or 'vinegar'.

proximity_term :

This is used when the words or phrases being searched for are required to be in close proximity to one another.

The syntax is:

                       +----------------------------+                                               
                       v                            |
·---simple_term_1------NEAR()----simple_term_n-----·
     |               |  + ~ -----+|               |
     +-prefix_term_1-+            +-prefix_term_n-+

The proximity_term is similar to the AND operator in that more than one word or phrase must exist in the column being searched. It differs from AND because the relevance of the match increases as the words appear "closer" together.

The syntax is designed to be extensible for possible future support for specification of units of proximity such as words, sentences, paragraphs, chapters, etc.

NEAR() and ~ mean the same thing: the word or the phrase on the left side of the operator is "close" to the other word or phrase. What does "close" mean? The answer is purposefully vague. It can be thought of as being approximately within 50 words of one another but the algorithm is more complicated than that. While words within the same sentence are 1 word distance from each other, larger gaps are assigned between units such as sentences, paragraphs, and chapters. Even if words or phrases are very far apart, the query is still considered to be satisfied; the row just has a very low (zero) rank value. However, if the contains_condition consists only of NEAR() proximity terms, then SQL Server does not return rows with a rank value of zero.

It is possible to chain the proximity matching; for example:

   a ~ b ~ c

means a should be near b, which should be near c.

Because of the fuzzy nature of full-text searches, it is often desirable to see the rank values mentioned above. The ContainsTable() rowset-valued function (described later) can be used to execute queries that return a rank value for each row.

Some examples of proximity_term used in the context of the CONTAINS predicate in a WHERE clause follow:

WHERE CONTAINS( context_col, ' hockey ~ player ' )

This matches context_col values that contain the word 'hockey' in close proximity to the word 'player'.

WHERE CONTAINS( context_col, ' hockey ~ "play*" ')

This matches context_col values that contain the word 'hockey' in close proximity to a word that starts with 'play'.

WHERE CONTAINS( context_col, '  "great*" 
                              ~ "Maurice Richard" ')

This matches context_col values that contain a word starting with great' in close proximity to the phrase "Maurice Richard".

generation_term :

Linguistic generation is used when the words being searched for need to be expanded to include the variants of the original word.

The syntax is:

                               +--------------+                                               
                               v              |  
·--FORMSOF--(--INFLECTIONAL---,simple_term---)--·

INFLECTIONAL means words that are plural and singular forms of nouns or the various tenses of verbs will be matched. A single term will not match both exclusive noun and exclusive verb forms. Again, the syntax is designed to be extensible enough to handle other linguistically-generated forms, such as derivational, soundex, and thesaurus.

An example of a generation_term used in the context of the CONTAINS predicate in a WHERE clause follows:

WHERE CONTAINS(curriculum_vite, 
               ' FORMSOF (INFLECTIONAL, skate) ' )

This matches curriculum_vite values that contain words such as 'skate', 'skates', 'skated', and 'skating'.

weighted_term:

This is used for queries that return rows that match a list of words and phrases, each optionally given its own weighting. Matching values only have to match a single element in the list.

The syntax is:

   ·--ISABOUT----------------------------------------->
        +--,--------------------------------------+                                               
        v                                         |  
   >--(----simple_term------------------------------)-·
         +-prefix_term-----|  +-WEIGHT-(-n.nnn-)-+
         +-proximity_term--|
         +-generation_term-+

where:

n.nnn : represents a decimal constant between 0 and 1.

A row is returned if there is a match on any one of the ISABOUT elements.

Each component in the vector can optionally be weighted. The weight assigned forces a different measurement of the rank value that is assigned to each row that matches the query.

Some examples of weighted_term used in the context of the CONTAINS predicate in a WHERE clause follow:

WHERE CONTAINS( article, 
                ' ISABOUT(hockey, puck, goalie) ' )

This matches article values that contain any of the words 'hockey', 'puck', or 'goalie' and the "better" matches will contain more than one of the words.

WHERE CONTAINS( article, 
         'ISABOUT("Toronto Maple Leafs" WEIGHT(1.0),
                  "Maple Leafs" WEIGHT(.5),
                  Leafs WEIGHT(.2) ) 
         ' )

This matches article values that may have information about the Toronto Maple Leafs hockey team with higher rank values being assigned to articles if they have more words from the phrase.

FREETEXT predicate

FREETEXT is used to determine whether or not values in full-text registered columns reflect the meaning, rather than the exact words, specified in the predicate.

The syntax is:

   ·--FREETEXT--(---column_ref------------------------> 
                  +-*----------+ 
   >--,--'--freetext_string--'--)---------------------·

This is an extremely simple form of natural language query where the index engine internally "word-breaks" the freetext_string into a number of search terms, generates the stemmed form of the words, assigns each term some heuristic weighting, then finds the matches.

An example of a FREETEXT predicate used in a WHERE clause follows:

WHERE FREETEXT(articles, ' Who have been the most  
    valuable players for the Montreal Canadien? ' )

Combining predicates and use of predicates:

Because CONTAINS and FREETEXT are SQL predicates they can be used anywhere that SQL predicates are supported, that is in any search condition. In particular, they can be combined with each other and with other predicates such as equality, LIKE, and BETWEEN to specify extensive search conditions.

The WHERE clause in the following query uses both a CONTAINS predicate and an comparison predicate. It obtains the title and publication year of all the books in the 'titles' table in the 'pubs' database where the book costs less than $20.00 and text in the 'notes' column indicates that the book is about ice hockey.

SELECT title, DatePart(year, pubdate) 
  FROM pubs
  WHERE price < 20.00
    AND CONTAINS (notes, ' "ice hockey" ')

The following query uses CONTAINS within a subquery. It obtains the title of all the books in the 'titles' table for the publisher who is located close to the flying saucer in Moonbeam, Ontario. This information about the publisher is known to exist in the 'pr_info' column in the 'pub_info' table and it is also known that there is only one such publisher.

SELECT T.title, P.pub_name 
  FROM publishers P,
       Titles T
  WHERE P.pub_id = T.pub_id
    AND P.pub_id = (SELECT pub_id
                      FROM pub_info
                      WHERE CONTAINS 
                              (pr_info, 
                               ' moonbeam AND
                                 ontario AND
                                 "flying saucer" 
                               ') )

ContainsTable() rowset-valued function:

ContainsTable() is used to issue "contains" type full-text queries that return relevance ranking for each row.

The syntax is:

  ·--CONTAINSTABLE--(--table_ref---,---column_ref------> 
                                   +-*----------+ 
   >--,--'--contains_condition--'--)-----------------·

where:

table_ref Refers to a full-text registered table

column_ref Refers to a specific full-text registered column

* Refers to all columns in table_ref that are full-text registered.

contains_condition Is the same as described for the CONTAINS predicate

Although both the CONTAINS predicate and ContainsTable() function are used for "contains" type full-text queries and the SQL used to specify the full-text search condition is the same in both, there are major differences in the way that these are used:

CONTAINS() returns true/false value, so is typically specified in the WHERE clause of a SELECT statement.

ContainTable() returns a table of zero, one, or more rows, so must always be specified in the FROM clause.
CONTAINS() can only be used to specify selection criteria which SQL Server uses to determine the membership of the result set.

ContainsTable() is also used to specify selection criteria. The table returned has a column named KEY that contains "full-text key" values. Each full-text registered table has a column whose values are guaranteed to be unique and the values returned in the KEY column are the "full-text key" values of the rows that match the selection criteria specified in the contains_condition. Furthermore, the table produced by ContainsTable() has a column named RANK which contains values between 0 and 1000 which can be used to rank the rows returned according to how well they met the selection criteria.

Queries that utilize the ContainsTable() function are more complex than those using the CONTAINS predicate because it is necessary to explicitly join qualifying rows returned by ContainsTable() with the rows in table_ref. We will expand the example from the Introduction to show how such queries can be written. In this example, the contents of some documents reside in the DocText column of a table named 'doc_collection' and the table also contains columns for StorName, Size, and DocAuthor. The unique key column for the table is named DocNo. We want the rows in the result set to be ordered so that those with the highest RANK value are returned first. The expansions to the earlier example are shown in bold.

SELECT Q.StorName, Q.Size, Q.DocAuthor, W.Citizenship 
FROM doc_collection as Q,
     writers as W,
ContainsTable(doc_collection, DocText,
                   ' "SQL Server" NEAR() text'
                  ) AS K
WHERE Q.DocAuthor = W.writer_name
AND K.[KEY] = Q.DocNo
ORDER BY K.RANK DESC

To simplify the use of ContainsTable() for queries that involve only one table and no grouping, we suggest the use of a fixed template such as:

The example that follows uses the above template. In this example, the wording on the plaques in the Hockey Hall of Fame reside in the PlaqueWording column of a table named HockeyHall and the table also contains columns for PlayerName, StartYear, and LastYear. The unique key column for the table is named PlaqueNo. We want to return the PlayerName, PlaqueNo, and RANK value for players who might have played for the Kenora teams in early 1900s. The higher ranking rows should be returned first.

FreetextTable() rowset-valued function:

Used to issue "freetext" type full-text queries that return relevance ranking for each row.

The syntax is:

   ·--FREETEXTTABLE--(-table_ref---,---column_ref-----> 
                                     +-*----------+ 
   >--,--'--freetext_string--'--)---------------------·

This function is used in the same manner as a ContainsTable() function and the search condition is the same as specified by a FREETEXT predicate.

Textual Queries Against Data in the File System :

This is the subject of another white paper.

It is important to understand that both facilities exist, so we will introduce the topic here.

It is also worth noting that the ability to search just against file system data already exists today in Index Server 2.0. This capability is further enhanced by the SQL Server 7.0 Distributed Query Processor which, in conjunction with the Microsoft OLE DB Provider for Index Server 2.0, can be used to issue both property and full-text searches against data in the file system and to join the results with data in the database. Detailed discussion of this topic is the subject of another white paper.

For now, here in this paper, We will show how such queries are specified by recasting the first example in this paper in order to obtain some of the data from the file system. The following query:

selects the names, sizes, and authors of all Microsoft Word files on the D drive
where the document contains the phrase "SQL Server" in close proximity to "text"

and joins this with the 'writers' table to obtain the author's citizenship

SELECT Q.FileName, Q.Size, Q.DocAuthor, W.Citizenship 
FROM OpenQuery(MyLinkedServer, 
               'SELECT FileName, Size, DocAuthor
                from SCOPE('' "D:\" '')
                WHERE CONTAINS(''"SQL Server"
                                 NEAR() text'')
                AND FileName LIKE ''%.doc%'' '
              ) as Q,
     writers as W
WHERE Q.DocAuthor = W.writer_name

From this query, it is obvious that the location of the data must be known to people writing the queries. The ISO SQL standards organization is working on defining a new DATALINK data type to make possible seamless queries that access data both inside and outside the database. This emerging definition offers some interesting possibilities.

Full-Text Indexing Component Architecture:

There are two aspects to the Component Architecture: the Query Components and the Indexing Components. The Query Components are discussed in the section that follows this one.

The Indexing Components manage the initial population of the full-text indexes and the subsequent updating of these indexes.

Before presenting the steps involved in the indexing process, we will introduce the various sub-components. The flow among these sub-components will then be described.

SQL Server Enterprise Manager User Interface: This is the GUI for the Full-Text Indexing administration utility. It takes the form of extensions to the current SQL Server Enterprise Manager property sheets, plus a new wizard. These have been designed to assist first-time and infrequent administrators of full-text indexed tables. Having said that, the GUI is still appropriate for frequent users. The GUI allows a user to select the desired tables for full-text indexing and walks him/her through the various steps needed to set this up.

One of the options is to schedule regular refreshing of full-text indexes; this utilizes the scheduling facilities that already exist in SQL Server. The GUI can also be used to display properties of the full-text indexed tables. This User Interface utilizes a set of new full-text administration system stored procedures, a set of full-text related properties available via SQL Server's "property" functions as well as the already existing SQL Server Agent Job Scheduler.

Full-Text Administration System Stored Procedures: This set of system stored procedures is used to:

set up full-text catalogs (a full-text catalog is simply an entity holding a collection of full-text indexes).
register a table and selected columns within it for full-text search.
request the population of the indexes within a full-text catalog.
undo any of the above.

The actual setup and configuration capability of full-text indexing resides in these stored procedures, not the GUI. It is possible to use these stored procedures directly rather than going through SQL Server Enterprise Manager and experienced administrators may choose to do this. Calls to the stored procedures may also be embedded in scripts and in other stored procedures.

Scheduling Stored Procedures: The set of stored procedures (such as sp_add_jobschedule) used for scheduling "regular" jobs can also be used to schedule refreshes of full-text catalogs.

Full-Text Indexing Component Architecture

Full-Text Index Additions to System Tables: The 'sysdatabases' system catalog has been extended to provide the new IsFulltextEnabled property which is available via the DatabaseProperty() function.

A new system table 'sysfulltextcatalogs' is now present for each database. This table holds metadata about the full-text catalogs that are linked to this database. Note that a given full-text catalog is linked to one and only one database.

The 'sysobjects', 'sysindexes', and 'syscolumns' tables that reside in each database will be augmented with information about the tables and the associated columns in that database that have been full-text indexed. The information is:

the 'ftcatid' column of user table rows in 'sysobjects', which points to a row in 'sysfulltextcatalogs'.
the TableHasActiveFulltextIndex property, which is available via the ObjectProperty() function.
the TableFulltextIKeyColumn property, which is the column ID of the full-text unique key column and is available via the ObjectProperty() function.
the TableFulltextCatalogId property, which is the full-text catalog ID associated with the table and is available via the ObjectProperty() function.
the IsFulltextKey property, which is available via the IndexProperty() function.
the IsFulltextIndexed property, which is available via the ColumnProperty() function.

Microsoft Search Service 3 : This is a Microsoft Windows NT® service which has two roles:

Indexing Support takes care of accepting a request to populate the full-text index of a given table.
Querying Support processes full-text search queries. This is described later.

Microsoft Search Service operates under the context of the local system account. This service must be run on the same machine as SQL Server.

Indexing Support: The information about populating full-text indexes is submitted in the form of a population start seed value which uniquely identifies both the database and the table that need to be full-text indexed. When the service is ready to handle the population, it invokes the SQL Server Handler.

SQL Server Handler: This sub-component is a "driver" that plugs into Indexing Support and is specially coded to understand how to deal with SQL Server data stores.

The SQL Server Handler always runs in the same process as Microsoft Search Service.

Index Engine: This sub-component is presented with indexable units of text (in the form of character strings), each with an identifying key. It scans through character strings, determines the word boundaries, removes all noise-words, then populates a full-text index with the remaining words.

Querying Support: This is used to process full-text queries so is not described until later.

Full-Text Catalogs: This is the location where full-text indexes reside. This is a Windows NT file system directory that is accessible only by Windows NT Administrator and Microsoft Search Service. The full-text indexes are organized into full-text catalogs, which are referenced by friendly names. Typically, the full-text index data for an entire database is placed into a single full-text catalog. However, administrators do have the flexibility to partition the full-text index data for a database across more than one full-text catalog. This is particularly useful if one or more of the tables being full-text indexed contain a large number of rows. A full-text catalog is limited to holding index data for 231 – 64K rows.

In SQL Server 7.0, there are no facilities to coordinate backup and recovery relational database data in conjunction with the relevant indexes in the full-text catalogs (however, there are facilities to get them back into sync). Coordinated backup and recovery is a candidate in a future release.

Full-Text Indexing Administration Steps:

All the steps below are "kicked off" by an administrator using GUIs or stored procedures (either directly or on a scheduled basis).

#0:	The preliminary steps are first to enable the database for full-text search and then to identify the tables and columns that are to be registered for full-text search. This information is stored in SQL Server's system tables.
#1:	When a table is activated for full-text processing, a population start seed for that table is sent to Indexing Support.
#2:	The next step is to request the initial population of the full-text index for a table. Actually, the granularity of population is a full-text catalog, so if more than one table has been linked to a full-text catalog, the result of this request is the full population of the full-text indexes for all tables linked to that catalog.
#3:	The knowledge of the tables and rows that require indexing resides in SQL Server, so when Indexing Support receives a population request for a full-text catalog, it calls back into SQL Server in order to obtain data from all the columns in the table that have been marked for indexing. When this data arrives, it is passed to the Index Engine where it broken into words, noise-words are removed, and the remaining words are stored in the index.
#4:	How are full-text indexes kept up to date? A GUI is provided that sets up a schedule for periodic refreshes of a full-text catalog. This GUI utilizes stored procedures from the SQL Server Job Scheduler. It is also possible to request a refresh at any time – either via a GUI or by direct use of a stored procedure.
	If a table has a row-versioning (timestamp) column, re-population can be handled more efficiently. At the time the population start seed for a table is constructed, the largest row-versioning value in the database at that point is remembered. When an incremental population is requested, the SQL Server Handler connects to the database and requests only rows where the row-versioning value is greater than the remembered value.
	During an incremental population, if there is a full-text indexed table in a catalog that does not have a row-versioning column, then that table will be completely re-populated.
	There are two cases where a complete re-population is performed even though there is a row-versioning column on the table. These are: when the schema of the table has changed. when tables have been activated since the last population.
	Following population of tables with a row-versioning column, the "remembered" row-versioning value is updated.
	Since incremental re-population on relatively static tables with timestamp columns can be completed faster than a complete re-population, users can schedule re-population on a more frequent basis. For a given database, users should take care not to mix index data from tables with timestamp columns with tables that do not in the same full-text catalog, as these two groups of tables should usually be on separate re-population schedules.
#n:	Additional tables and columns will be registered for full-text search and full-text indexes will be generated for them. Possibly, the full-text search capability will be removed from some tables and columns. Possibly, some full-text catalogs will be dropped.

Full-Text Query Component Architecture:

The Query Components accept a full-text predicate or rowset-valued function from SQL Server, transforms parts of the predicate into an internal format and sends it to Microsoft Search Service, which returns the matches in a rowset. The rowset is then sent back to SQL Server. SQL Server uses this information to create the result set which is then returned to the submitter of the query.

The SQL Relational Engine accepts CONTAINS and FREETEXT predicates as well as the ContainsTable() and FreetextTable() rowset-valued functions. During parse time, this code checks for conditions such as attempting to query a column that has not been registered for full-text search. If valid, then at run time the ft_search_condition and context information is sent to the Full-Text Provider. Eventually the Full-Text Provider returns a rowset to SQL Server. This rowset is used in any joins either specified or implied in the original query.

The Full-Text Provider parses and validates ft_search_condition, constructs the appropriate internal representation of the full-text search condition, then passes it to the Search Engine. The result is returned to the Relational Engine via a rowset of rows that satisfy ft_search_condition. The handling of this rowset is conceptually similar to the code used in support of the OpenRowset() and OpenQuery() rowset-valued functions.

Microsoft Search Service: This service was described under the Full-Text Indexing Component Architecture above.

Search Engine: This is the sub-component that processes full-text search queries. It determines which entries in the index meet the selection criteria. For each entry that meet the selection criteria, it returns the value of the unique key column and a ranking value.

Indexing Support: This was described under the Full-Text Indexing Component Architecture above.

Full-Text Query Component Architecture

The Full-Text Query Process:

#1:	A query that uses one of the full-text constructs (that is, CONTAINS, FREETEXT, ContainsTable(), or FreetextTable) is submitted to the SQL Relational Engine.
#2:	Queries containing either the CONTAINS or FREETEXT predicate are rewritten so that the rowset returned from the Full-Text Provider will later be automatically joined to the table that the predicate is acting upon. This rewrite is the mechanism used to ensure that these predicates are a seamless extension to SQL Server. People specifying CONTAINS and FREETEXT do not have to concern themselves with or even know about the details of the underlying invocation of Microsoft Search Service.
#3:	The Full-Text Provider is invoked, passing it information such as: the ft_search_condition the friendly name of the full-text catalog where the table's full test index resides the locale id to be used for language (e.g., word breaking) the identities of the database, table, and column
	If the query is comprised of more than one full-text construct, the Full-Text Provider is invoked separately for each such construct.
	Notice that the SQL Relational Engine itself does not look into the contents of the ft_search_condition. This is passed along to the Full-Text Provider which checks it for validity, then creates the appropriate internal representation of the full-text search condition.
#4:	The command is passed to Querying Support.
#5:	Querying Support returns a rowset that contains the unique key column values for the rows that match the full-text search criteria. A RANK value is also returned for each row.
#6:	The rowset is passed along to the SQL Relational Engine. If processing either a ContainsTable() or FreetextTable() function, RANK values are returned; otherwise the RANK value is filtered out.
#7:	The rowset values are "plugged into" the query together with values obtained from the relational database itself and the result set is returned to the user.

Administration of Full-Text Indexes

In the current implementation, full-text indexes are quite different than "classic" SQL indexes. These differences give rise to the need for the administrative tasks that are outlined in this section. Following is a summary of the main differences:

Classic SQL Indexes	Full-Text Indexes
The indexes are stored in and under the control of the database.	Full-indexes are stored in the file system, but are administered via the database.
There can be several indexes per table.	There is only one full-text index definition per table.
Indexes are automatically updated when the data upon which they are based is inserted, updated, or deleted.	The population of full-text indexes must be requested (either via a schedule or a specific request).
Indexes are not grouped.	A group (one or more) of full-text indexes within the same database are gathered together into a full-text catalog.
Indexes are created and dropped using SQL statements.	Full-Text indexes are created, managed and dropped using stored procedures (either directly of via GUIs).

Full-Text Administration is carried out at several different levels:

Server: Certain server-wide properties can be set - such as the required properties for the Microsoft Search Service to initialize and prepare for index population.
Database: A database must be enabled to use full-text search. Metadata for one or more full-text catalogs can be created and dropped in an enabled database.
Full-Text Catalog: A full-text catalog must be populated using administrative facilities.
Table: A table must be registered as supporting full-text queries at which time metadata is created for that table's full-text index. A registered table must be activated before it can participate in the next full-text catalog population.
Column: Columns that support full text queries can be added or dropped from an inactive registered table.

And, at all these levels, there are facilities to retrieve property, status, and volume (e.g., size of the full-text catalog) information. As we outlined earlier, these facilities take the form of both GUIs and stored procedures and almost all tasks can be accomplished either way. Because of the technical orientation of this paper, we will concentrate on the stored procedures and only give a few examples of the use of GUIs. We have been using the term "stored procedures" as illustrative of the level at which an administrator is communicating with SQL Server. In fact, one uses a combination of stored procedures and scalar-valued property functions.

Following is a scenario of the steps an administrator, operating in the "stored procedure mode", could take in order to set up full-text search on selected tables and columns in the 'pubs' database. For each step, the [light] type shows the level at which the administration is being carried out. We will not show everything that is available, but only enough to provide insight as to how these procedures can be used.

[server] Check that the Microsoft Search Service is running. This can be checked by looking at the shape/color of the icon for Full-Text Search in SQL Server Enterprise Manager.

If necessary, the service can be started in one of several ways:
- via the context menu off the Full-Text Search object in SQL Server Enterprise Manager,
- utside SQL Server via the Service Control Manager (note that in the Service Control Manager, this service is named "Microsoft Search Service"),
- from an MS-DOS® prompt by typing "net start mssearch".
- From the SQL Server Service Manager.
The full-text service can also be stopped in all the places listed above.

In this case, 'pubs' has not been so enabled.
[server] Find out if the pubs database has been enabled for full-text processing. The SQL statement:

SELECT DatabaseProperty('pubs',

'IsFulltextEnabled' )

returns 1 if the service has been enabled and 0 if it has not.

In this case, the service has been started.
[server] Connect to the pubs database by executing the SQL statement:

USE pubs
[database] Enable pubs for full-text search by invoking the stored procedure:

sp_fulltext_database 'enable'
[ft catalog] Create a full-text catalog named PubsCatalog, opting for the default directory. This is done by invoking the stored procedure:

sp_fulltext_catalog 'PubsCatalog', 'create'

This procedure creates metadata about a full-text catalog in the database's system tables and builds an empty full-text catalog in the file system. The file is created in a default root directory which can be overridden, if desired, by use of a third parameter.
[table] Register the 'authors', 'jobs', 'pub_info' and 'titles' tables for full-text processing. Tables so registered must have a column (called the full-text unique key column ) whose value is guaranteed to have a unique value for each row. Since all these tables have a primary key that consists of a single column, they all qualify.

To register such a table, it is necessary to specify the name of the index that enforces the unique value for the unique key column. For a given table, this information can be obtained using the sp_help_index stored procedure. The indexes that we are interested in are:

authors

UPKCL_auidind

jobs

PK_jobs_22AA996

pub_info

UPKCL_pubinfo

titles

UPKCL_titleind

Knowing these names, it is now possible to register the tables by invoking the sp_fulltext_table stored procedure once for each table:

sp_fulltext_table 'authors', 'create',

'PubsCatalog', 'UPKCL_auidind'

sp_fulltext_table 'jobs', 'create',

'PubsCatalog', 'PK_jobs_22AA996'

sp_fulltext_table 'pub_info', 'create',

'PubsCatalog', 'UPKCL_pubinfo'

sp_fulltext_table 'titles', 'create',

'PubsCatalog', 'UPKCL_titleind'

The effect of these procedure invocations is to update the metadata in the systems tables both for this full-text catalog and for these tables.
[column] For each of the newly registered tables, specify the names of the columns that are to be registered. This is done by invoking the sp_fulltext_column stored procedure once for each column:

sp_fulltext_column 'authors', 'address', 'add'

sp_fulltext_column 'jobs', 'job_desc', 'add'

sp_fulltext_column 'pub_info', 'pr_info', 'add'

sp_fulltext_column 'titles', 'type', 'add'

sp_fulltext_column 'titles', 'notes', 'add'

The effect of these procedure invocations is to augment the metadata in the system tables.

By the way, a mistake was made (on purpose, for the sake of illustration) and, for the 'titles' table, the 'type' column, rather than the 'titles' column has been registered.
[table] Before an actual full-text index can be created, it must be in the active state. Activate the ability to create a full-text index for these tables by invoking the sp_fulltext_table stored procedure once for each table:

sp_fulltext_table 'authors', 'activate'

sp_fulltext_table 'jobs', 'activate'

sp_fulltext_table 'pub_info', 'activate'

sp_fulltext_table 'titles', 'activate'

Note that this does not actually create the full-text indexes, it simply registers the tables as active in the full-text catalog's metadata so that data from these tables will be included in the next population.4
[ft catalog] Start a full population of the PubsCatalog full-text catalog by invoking the sp_fulltext_catalog stored procedure:

sp_fulltext_catalog 'PubsCatalog', 'start_full'

Note that the population of a full-text catalog is an asynchronous operation. This means that, immediately following the execution of the procedure and the return to the procedure's caller, it is very unlikely that the full-text indexes will have been created yet.
[ft catalog] Inquire into the progress of the population of the PubsCatalog full-text catalog. The statement:

SELECT FulltextCatalogProperty

( 'PubsCatalog', 'PopulateStatus' )

returns 0 if the service is idle for the full-text catalog and therefore (supposedly) finished and 1 or more to indicate various stages of population.
[testing] Issue some SQL queries to confirm that the administration was done properly. For example, issue the following SQL statement:

SELECT P.pub_name, T.title, T.price

FROM publishers P, titles.T

WHERE P.pub_id = T.pub_id

AND P.country = 'England'

AND CONTAINS (T.notes,

'"case is altered" OR

"cat and custard pot" OR

"the monarch and the sphinx"

' )
[testing] Issue the following SQL query:

SELECT title_id, title, pubdate

FROM titles

WHERE CONTAINS (T.title,

'classic ~ french ~ cooking')

This results in an error because the 'title' column was not registered for full-text queries.
[column] Check to see is there is some mistake. The statement:

SELECT ColumnProperty ( ObjectId('titles'),

'titles',

'IsFulltextIndexed' )

returns 1 if title is part of the full-text index for the books table and 0 if it is not. In this case, the value returned is 0.
[table] List the columns that participate in full-text processing for 'titles' by invoking the stored procedure:

sp_help_fulltext_columns 'titles'

The results of this query will show that there was a mistake and that 'type', instead of 'title', was included in the full-text index definition.
[table] Deactivate the 'titles' table so that the 'title' column can be added to the full-text index and the 'type' column can be removed. This is done by invoking the stored procedure:

sp_fulltext_table 'titles', 'deactivate'

As well as allowing columns to be added and deleted, the effect of deactivating the 'titles' table means that the table no longer participates in the population of the PubsCatalog full-text catalog. However, the metadata remains and the table can be re-activated. The existing full-text index for the 'titles' table remains in place until the next full population of PubsCatalog but it is unused because SQL Server blocks queries on deactivated tables.
[column] Add the 'title' column and remove the 'type' column from the metadata for the 'title' table's full-text index. This is done by invoking the sp_fulltext_column stored procedure once for each column:

sp_fulltext_column 'titles', 'type', 'drop'

sp_fulltext_column 'titles', 'title', 'add'
[table] Activate the books table again by invoking the stored procedure:

sp_fulltext_table 'titles', 'activate'

Note that the table is re-activated and the index is not repopulated. The old index is still available for queries against all the full-text registered columns that remain (but not, of course, against any new full-text registered columns). Another thing to beware of before re-population, is that data from deleted columns will be matched on queries that specify an * search ( i.e., all full-text columns in table).
[ft catalog] Start an incremental population of the PubsCatalog full-text catalog by invoking the stored procedure:

sp_fulltext_catalog 'PubsCatalog',

'start_incremental'

An incremental population will refresh the full-text catalog by indexing data in the full-text enabled columns that meet any the following criteria:
- Data from rows that have been updated or inserted since the last population
  - in tables that have a column of type TIMESTAMP.
  Data from all rows:
  - in tables that do not have a column of type TIMESTAMP.
  - in tables that were enabled for full-text processing since the last population.
  - In tables whose schema has been modified in any way since the last population, data from all rows.
[testing] After waiting for the re-population of PubsCatalog to be completed, re-issue the query from step [testing] Issue the following SQL query:. This time, no error is raised.

We turn now to the graphical user interfaces. They are very simple and straightforward, so we will concentrate on showing how they fit into the rest of GUI for the SQL Server Enterprise Manager . Five of the major "entry points" into full-text administration are shown with large numbered arrows superimposed over the screen fragments that follow. Unlike other parts of this paper, we are not using the numbers to show a sequence of actions, they are just reference points.

A) SQL Server Enterprise Manager's Console Tree:

sqlfts03

There are two full-text objects, on the console tree:

1 Full-Text Search: Right-clicking on this brings up a menu that has options for things like starting and stopping the full-text search service.

2 Full-Text Catalogs: Selecting this causes a list of the database's full-text catalogs to be displayed in the details pane (see screen fragment E).

Right-clicking this results in a context menu that has options for things like creating new full-text catalogs and rebuilding all the databases full-text catalogs.

B) SQL Server Enterprise Manager's Tools Menu:

sqlfts04

The option of interest here is:

3 Full-Text Indexing: selecting this launches the Full-Text Indexing Wizard (see screen fragment H).

C) The Context Menu of a Typical Table:

sqlfts05

The 4 Full-Text Index Table option has suboptions that can be used to lunch the Full-Text Indexing Wizard in order to define or to modify the full-text index specs for a table or to remove full-text indexing from the table.

D) The Property Sheet Tabs of a Typical Table (parts thereof):

sqlfts06

The new tab is:

5 Full-Text Indexing: Selecting this tab selects a property page. The above fragment shows portions of such a typical page.

E) A Typical Full-text Catalog's Catalog Pane and Context Menu:

sqlfts07

This menu was obtained by the following steps:

Selecting the Full-Text Catalogs object in screen fragment A in order to obtain the list of full-text catalogs.
Right-clicking the PubsCatalog item in order to obtain the context menu that is shown in the screen fragment.

Selecting the Properties item results in screen fragment F.

Selecting the Schedules item results in screen fragment G.

F) The Status Property Page for a Typical Full-text Catalog:

sqlfts08

This property page fragment was obtained by selecting the Properties option in the menu shown in screen fragment E.

G) The Schedules Property Page for a Typical Full-text Catalog:

sqlfts09

This property page fragment was obtained by selecting the Schedules tab shown in screen fragment E.

H) The Full-Text Indexing Wizard:

sqlfts10

This wizard may be invoked from many places. The two that we have shown here are by selecting the Full-Text Indexing option from screen fragment B or one of the Full-Text Index Table options from screen fragment C. The wizard gathers all the information necessary to create and maintain a full-text index for a given table.

Remember that registering and activating a table does not mean that the full-text queries can be issued against the table; a full text index must first be created and the wizard does not kick this off. Therefore, after registering one of more tables, it customary to start a full population of that (those) table's full-text catalogs. This can be done by selecting the Start Population option shown in screen fragment E.

Conclusions

We are excited to be able to provide this capability in SQL Server 7.0. The features we support are at entry level, but still serve many very useful full-text searching purposes for our customers.

The fact that even this entry level provides the ability to include data both inside and outside the database is significant. The SQL Server 7.0 Distributed Query Processor enables advanced capability by being able to join full-text query results in the database with results generated by queries against Index Server 2.0 in the file system. We believe that a excellent foundation has been paved during the integration of the Microsoft Information Retrieval technologies into SQL Server 7.0: technologies that have already shipped in several Microsoft products and that will serve as the basis for other Microsoft products that support text search.

Our component reuse strategy benefits the customer by providing seamless upgrade capability in the underlying index and search engine, and cross-product consistency.

1 Within the context of SQL Server, this is also referred to as "full-text search".

2 Also called stop-words in the industry.

3 Also referred to as just Microsoft Search, and as the full-text search.

4 In addition, this defines full-text population start seeds for each table to the Full-Text Service.

authors	UPKCL_auidind
jobs	PK_jobs_22AA996
pub_info	UPKCL_pubinfo
titles	UPKCL_titleind

Extensions to SQL Server to Support Full-Text Search

Abstract

Introduction

Additional resources