Chapter 23 - Textual Searches on File Data

Microsoft SQL Server version 7.0 introduces facilities that support textual queries on data in SQL Server as well as on data in the file system. This chapter describes searches on data in the file system. Several products and features have been brought together to support this capability, including SQL Server distributed queries, Windows NT Server built-in Web server, Microsoft Internet Information Services (IIS) version 4.0, and Microsoft Index Services version 2.0. This chapter is for those who are familiar with SQL Server but not necessarily familiar with its textual search features and products.

This chapter introduces the types of textual searches supported by SQL Server and illustrates the roles performed by both the IIS and Index Services technologies. It also introduces SQL Server distributed queries and describes their use in the processing of textual searches. It then addresses the question of how SQL queries can be written against the file system when there are no tables in the file system. Finally, the SQL extensions to support full-text queries are described, and several examples, including examples that combine file data with database data, are provided.

This chapter provides an overview of how to incorporate file data into SQL queries and of how the various components of SQL Server and other software interact to provide support for such queries.

A large portion of digitally stored information is in the form of unstructured data, primarily text, which is stored in the file system. This data is often related to data within the database, and there are requirements to support searches that include both sources. However, it is often inappropriate to import this data into the file system. SQL Server 7.0 distributed queries, coupled with extensions to the SQL language, make it possible to write such queries without the data. This capability is called file content search.

There are two major types of textual searches:

Property

This search technology first applies filters to documents to extract properties such as author, subject, type, word count, printed page count, and time last written, and then issues searches against those properties.

Full-text

This search technology first creates indexes of all non-noise words in the documents, and then uses these indexes to support linguistic searches and proximity searches (searches for words or phrases that are "close" to each other).

File content search supports both these types of textual searches and couples them with the ability to incorporate such searches into a query that includes relational operations against database data. For example, the following search selects the names, sizes, and authors of all Microsoft Word files on drive D that contain the phrase "SQL Server" in proximity to the word "text." It then joins this result with the writers table to obtain the author's citizenship.

SELECT Q.FileName, Q.Size, Q.DocAuthor, W.Citizenship 
FROM OpenQuery(MyLinkedServer, 
'SELECT FileName, Size, DocAuthor
FROM SCOPE('' "D:\" '')
WHERE CONTAINS(''"SQL Server"
NEAR() text'')
AND FileName LIKE ''%.doc%'' '
) AS Q,
writers AS W
WHERE Q.DocAuthor = W.writer_name

File content search relies on the Microsoft OLE DB Provider for Index Services. It also relies on Index Services for support of underlying filters and full-text indexes.

Notice that the OLE DB Provider gives Index Services 2.0 the ability to support SQL queries against data in the file system independent of SQL Server. The core extensions SQL that support such queries are the same in Index Services and SQL Server.

This chapter has the following purposes:

  • To illustrate the use of IIS 4.0, Index Services 2.0, and SQL Server 7.0 in support of file content search 

  • To introduce the SQL language extensions in support of textual searches on file data 

  • To present several sample queries to encourage you to start using textual searches 

Only a small percentage of the facilities available in the supporting products are introduced in this chapter, and even when a facility is discussed, many of its options are not discussed. For more information, see SQL Server Books Online and your Windows NT documentation.

Internet Information Services and Index Services

Microsoft Internet Information Server (IIS) 4.0 and Index Services 2.0 (both part of the Microsoft Windows NT 4.0 Option Pack) combine to provide property filtering and searching as well as full-text indexing and searching of file data. Windows NT 4.01 Service Pack 4 must be installed for proper interaction between Index Services and SQL Server.

All of these capabilities are available completely independent of SQL Server. In particular, there are at least two ways to search that do not use SQL Server. One of these employs an Index Services specific query language; the other supports SQL-based queries within ActiveX Data Objects (ADO). Neither alternative is discussed in this chapter except to say that SQL used in ADO queries is consistent with the SQL extensions outlined here. This chapter discusses property filtering and full-text indexing.

Index Services provides filters for several file formats, including Microsoft Word, Microsoft PowerPoint, Microsoft Excel, and HTML. Filters are also available for plain–text documents. Filters can be written by customers and third-party vendors for other formats such as Adobe Acrobat. Filters provide support for non-plain–text documents and capture property values both from the file content and about the files. Assuming that every file is a document, examples of properties include the document's title, the number of its pages with notes (for PowerPoint documents), the number of paragraphs it contains, the date and time when it was last accessed, and its physical path. A full list of file properties is provided later in this chapter. For more information, see your Windows NT documentation.

Full-text indexes are created by scanning file content. The process consists of tracking which significant words are used and where they are located. For example, a full-text index might indicate that the word "Canada" is found at word number 227, word number 473, and word number 1,017 in a given file. This index structure supports an efficient search for all items containing indexed words, as well as advanced search operations such as phrase searches and proximity searches. An example of a phrase search is looking for "white elephant," where "white" is immediately followed by "elephant." An example of a proximity search is looking for "big," where "big" occurs near "house."

To prevent the full-text index from becoming bloated, noise words (words that are too common to expedite the search, such as "a," "and," "the," and "therefore")* *are ignored. SQL Server setup for version 7.0 installs noise-word lists for many languages in directory \Mssql\Ftdata\Sqlserver\Config. And the set of supported languages is growing. The choice of a particular noise-word list is based on the language of the material that is file-format dependent during the filtering process. Some files set the language per section or paragraph; some specify it for the entire document. These noise-word lists should be sufficient for most operations, but they can be modified.

Index Services stores indexes and property values in a text search catalog. By default, a text search catalog named Web is created when Index Services is installed. It is possible to specify more than one text search catalog, but this chapter confines itself to the use of Web and does not discuss the process used to create additional text search catalogs.

A given text search catalog references one or more IIS virtual directories. A virtual directory references one or more physical directories and, optionally, other virtual directories. When a real file is linked to the text search catalog by means of a virtual directory, Index Services is notified of any new files that need to be indexed and begins filtering and indexing the properties and content associated with those files. Index Services is also notified of any subsequent changes to the files and will refilter and reindex any updated files.* *

The following screen shot shows a directory for the default Web site on the computer Pellow-2. Notice that /Corpus is listed in both panes. /Corpus is the alias of a virtual directory that, in turn, points to the real directory, D:\Corpus. All of the files in D:\Corpus have their properties and full-text indexes maintained in the Web text search catalog.

Cc966481.sqc24001(en-us,TechNet.10).gif

The following screen shot demonstrates how the Virtual Directory Wizard can be invoked to insert new virtual directories into the tree.

Cc966481.sqc24002(en-us,TechNet.10).gif

This final screen shot demonstrates the result of using the Virtual Directory Wizard to add the /SQL_standards virtual directory, which contains two virtual directories and one real directory.

Cc966481.sqc24003(en-us,TechNet.10).gif

SQL Server Distributed Queries

SQL Server 7.0 supports access to data in multiple, heterogeneous data sources, which can be on either the same or different computers. The data can be stored in various relational and nonrelational data sources for which there is either an OLE DB provider or ODBC driver. The OLE DB provider exposes its data in tabular objects called rowsets. SQL Server 7.0 allows a rowset from an OLE DB provider to be referenced in the FROM clause of a SQL query as if it were a SQL Server table.

OLE DB Provider for Index Services

In the examples discussed in this chapter, OLE DB provider is supplied by Index Services.

The sp_addlinkedserver stored procedure may be used to register data sources that will be referenced in distributed queries. To register OLE DB Provider for Index Services for the Web text search catalog on the computer on which SQL Server is running, execute this statement:

EXECUTE sp_AddLinkedServer FileSystem, 
'Index Services', 
'MSIDXS', 
'Web'

Here are the definitions of the syntax terminology:

FileSystem

The linked_server_name assigned to this particular linked server.

Index Services

The product_name of the data source.

MSIDXS

The provider_name (PROGID) of OLE DB Provider for Index Services.

Web

The name of the text search catalog that will be used for this linked server.

The OLE DB provider now can be referenced using the FileSystem linked_server_name in the new OPENQUERY() result-set–valued function. For example:

SELECT *
FROM OpenQuery(FileSystem,
'SELECT Directory, FileName, DocAuthor, Size, Create
FROM SCOPE()
WHERE CONTAINS( Contents, ''Distributed'' ) ' )

Notice that there are two SELECT statements. The inner SELECT statement (within the OPENQUERY() function) returns a result set as a table that can then be used like any other table in the FROM clause. In this case, the outer SELECT statement is a simple SELECT *, which passes on all the rows from the inner SELECT statement. Also notice that because the inner SELECT statement is specified as a constant parameter value within single quotation marks, all single quotation characters within the inner SELECT statement must be doubled. That is why ' 'Distributed' ' appears as it does. Two adjacent single quote marks (' ') are not the same as one double quotation mark (").

In general, SQL Server distributed queries support both read and update access to the data source. In the case of Index Services, only read access is appropriate. Generally, distribution is supported to remote computers; however, the SQL Server file content search feature has been tested only with all components and all data residing in the same computer.

Notice that OPENQUERY() does not work when running with a compatibility mode earlier than SQL Server 7.0. The compatibility mode can be set using the sp_dbcmptlevel stored procedure.

Security

There are special security considerations with OLE DB Provider for Index Services on the Windows NT operating system. SQL Server supplies a username and password on the current SQL Server login and on the login mapping set up in SQL Server of the form (current login, linked server) -> (remote login, remote password). However, OLE DB Provider for Index Services ignores the username and password and instead uses the Windows NT security context of the client (as if the client asked for a Windows NT Authentication mode connection). This means that OLE DB Provider for Index Services uses the Windows NT account under which SQL Server is running. Because this account is likely to be powerful, it can expose information about files to which the original SQL Server login has no privileges.

This concern has been addressed by giving SQL Server administrators full control over who has access to OLE DB Provider for Index Services through SQL Server. The administrator can control the login mappings so that no one other than those who have explicit login mappings can gain access to the server (for example, an Index Services linked server). The administrator can also disable ad hoc access against a given provider so that no one can access Index Services through the ad hoc route without using a linked server.

For example, if SQL Server is running under the Windows NT account sqlaccount, consider a linked server called mytextfilesthat has been configured to point to a particular Index Services text search catalog. On Windows NT 4.0, when a SQL Server user executes a distributed query against mytextfiles, this query is executed under the privileges of the Windows NT account under which SQL Server is running (sqlaccount).* Given this, the SQL Server security administrator must decide which SQL Server logins should have access to mytextfiles. This can be done by performing the following steps.

When a linked server is created with the sp_addlinkedserver stored procedure, by default all SQL Server logins have access to that server through self-mapping. The default mapping needs to be removed so that no one other than the approved logins can gain access to mytextfiles. This is accomplished by executing the following procedure:

-- Remove login mappings for all logins (by specifying NULL value for the @locallogin
parameter)
exec sp_droplinkedsrvlogin 'MyTextFiles', NULL

For each SQL Server login (either SQL Server Authentication or Windows NT Authentication) to be given access to the mytextfiles linked server, execute the following stored procedure:

-- Add a self mapping for local login to itself
exec sp_addlinkedsrvlogin 'MyTextFiles, true, 'local_login'

Remove the ability for ad hoc access to Index Services text search catalogs using the OPENROWSET() function. This task is accomplished through SQL Server Enterprise Manager by using the Linked Server Properties dialog box, which can be brought up by right-clicking on the corresponding linked server in SQL Server Enterprise Manager, and then clicking Properties.

Defined Tables

At least one defined table must be specified in every SQL query. A defined table is a table in which the number and types of columns are either known in advance or specified as part of the query. A relational database usually contains a number of defined tables, and metadata about the columns of these tables is stored in a schema.

The collection of files in a file system does not generally have a predefined structure. The closest thing to columns are the properties of a file, but there is no deterministic set of properties for files. The closest thing to a row is a file, but files are usually not grouped in a homogeneous collection akin to rows in a table. Thus, in this case, the table concept is unclear, SELECT * is meaningless, and both the rows and columns are unbounded. Another way of looking at this is that a file system effectively has a universal schema consisting of every possible file property, both known and unknown.

Index Services solves this problem by providing the SCOPE function as a means of defining the set of rows that makes up a virtual table and by providing file properties that substitute for columns.

SCOPE Function

The SCOPE function is specified in the FROM clause of the Index Services query. It specifies the set of files that make up a virtual table.

The syntax of the SCOPE function, simplified for this chapter, is as follows:

Cc966481.sqc24004(en-us,TechNet.10).gif

These are the definitions of the syntax terminology:

()

The virtual table consists of all the files that have been registered in the text search catalog and data source for the linked server specified in the OPENQUERY() function.

DEEP TRAVERSAL OF

The virtual table consists of all the files in the directory at the specified path or virtual directory as well as all the files in all the subdirectories (to any level) that are considered to be part of the virtual table. DEEP is the default.

SHALLOW TRAVERSAL OF

The virtual table consists only of the files in the top-level directory at the specified path or virtual directory that are considered to be part of the virtual table.

physical_path

A path to a real directory. If a real directory is specified, the filtering and indexing is done as part of the query processing, which can be time-consuming.

virtual_directory

The alias (or chain of aliases) assigned to a virtual directory that has been registered in the text search catalog and data source for the linked server specified in the OPENQUERY() function. In this case, the filtering and indexing will probably already have been done and, thus, the query will be much faster than when a physical path is specified.

File Properties

Index Services filters and maintains an excess of 50 file properties. All of these can be specified in text file search queries. From the perspective of writing a SELECT statement, there are three types of file properties:

  • Those that can be specified only in a WHERE clause

  • Those that can be specified in a WHERE clause and an ORDER BY clause. 

  • Those that can be specified in a WHERE clause and a select list. 

This table outlines some of the file properties.

Property name

SQL Data Type

Description

Use in ORDER BY clause

Use in select list

Access

datetime

Most recent date and time that the file was accessed.

Yes

Yes

Characterization

nvarchar or ntext

Abstract of the contents of the file. In Index Services 2.0, this is usually the first paragraph or first section of a document. In future releases, it is planned to be a real summary.

Yes

Contents

nvarchar or ntext

Main contents of the file.

Create

datetime

Date and time that the file was created.

Yes

Yes

Directory

nvarchar

Physical path to the file, not including the file name.

Yes

Yes

DocAuthor

nvarchar

Document author.

Yes

Yes

DocComments

nvarchar

Comments about the document.

Yes

Yes

DocLastAuthor

nvarchar

Most recent user that edited the document.

Yes

Yes

DocLastPrinted

datetime

Date and time that the document was last printed.

Yes

DocPageCount

integer

Number of pages in the document.

Yes

DocPartTitles

array of varchar

Names of the document parts:
• in Microsoft PowerPoint (slide titles).
• in Microsoft Excel (spreadsheets).
• in Microsoft Word (documents).

DocSubject

nvarchar

Subject of the document.

Yes

Yes

DocTitle

nvarchar

Title of the document.

Yes

Yes

DocWordCount

integer

Number of words in the document.

Yes

FileIndex

decimal (19,0)

Unique identifier of the file.

Yes

Yes

FileName

nvarchar

Name of the file.

Yes

Yes

HitCount

integer

Number of words matching the query.

Yes

Yes

Path

nvarchar

Full physical path to the file, including the file name.

Yes

Yes

Rank

integer

Value from 0 through 1,000, indicating how well this row matches the selection criteria.

Yes

Yes

Size

decimal (19, 0)

Size of the file (in bytes).

Yes

Yes

Write

datetime

Most recent date and time that the file was written.

Yes

Yes

Customers and third-party vendors can write filters to add to this set of file properties. They can also add properties, for example, by adding tags to an HTML document. In addition, to permit the query and retrieval of such user-defined file properties, the SQL extensions to Index Services include support for a SET statement that allows the specification of new file property names and their associated types.

Virtual Tables

It is possible to specify a query with the equivalent of a table in the file system, resulting in the select list and the FROM clause. For other parts of a SELECT statement, the properties can be used in place of columns in the WHERE and ORDER BY clauses. However, the GROUP BY and HAVING clauses are not supported by OLE DB Provider for Index Services. The following examples illustrate the use of all supported clauses.

The following query selects the full physical path and the file creation timestamp of all files in the /SQL-standards virtual directory and all its subdirectories, where the document contains the phrase "overloaded function."

SELECT * 
FROM OpenQuery(FileSystem, 
'SELECT Path, Create
FROM SCOPE('' "/SQL-standards" '')
WHERE CONTAINS(Contents, '' "overloaded function" '') 
'
) 

The next query is similar to the previous one, except that only files directly in the /SQL-standards virtual directory are considered.

SELECT * 
FROM OpenQuery(FileSystem, 
'SELECT Path, Create
FROM SCOPE('' SHALLOW TRAVERSAL OF "/SQL-standards" '')
WHERE CONTAINS(Contents, '' "overloaded function" '') 
' 
) 

The next query is also similar, except that only files directly in the /SQL3 virtual subdirectory are considered.

SELECT * 
FROM OpenQuery(FileSystem, 
'SELECT Path, Create
FROM SCOPE('' "/SQL-standards/SQL3" '')
WHERE CONTAINS(Contents, '' "overloaded function" '') 
' 
) 

The final query selects author, title, subject, and file name of documents in all files that are either in the /Corpus virtual directory and its subdirectories or in the \Temp directory on drive C, where the document is at least 5,000 words, the author is either Wendy Vasse or Anas Abbar, and the rows representing those documents with the most pages are ordered highest.

SELECT * 
FROM OpenQuery(FileSystem, 
'SELECT DocAuthor, DocTitle, DocSubject, FileName 
FROM SCOPE('' "/corpus" '',
'' "C:\temp" '' )
WHERE DocWordCount >= 5000 AND
( DocAuthor = ''Wendy Vasse'' OR
DocAuthor = ''Anas Abbar'' )
ORDER BY DocPageCount DESC 
' 
) 
Path Through the Components

The following figure contains the code for a typical query and a diagram that illustrates the part played by each component in processing the query.

Cc966481.sqc24005(en-us,TechNet.10).gif

These items relate to the numbered items in the previous diagram.

  1. The query is submitted to SQL Server, which separates the distributed portion of the query (the OPENQUERY() function) and forwards it to the SQL Server Distributed Query Handler. 

  2. The Distributed Query Handler passes the inner SELECT statement specified in the OPENQUERY() function to the to the OLE DB Provider for Index Services (MSIDXS) that has been linked to FileSystem

  3. MSIDXS parses the inner SELECT statement, then issues the appropriate commands to Index Services. 

  4. Index Services forms a virtual table by combining the files in the Web text search catalog's /Corpus virtual directory with the files in the C:\Temp path, selecting only files that contain 5,000 words or more. The results are returned as a rowset to MSIDXS. 

  5. MSIDXS returns the rowset to the Distributed Query Handler. 

  6. The Distributed Query Handler returns the rowset as a table to its calling component within SQL Server. This component, in turn, fulfills the rest of the query by joining this table with the writers table, and then returns the result set to the originator of the query. 

SQL Extensions for Index Services Full-Text Queries

SQL extensions for Index Services are consistent with the SQL supported for full-text search against relational database data. Furthermore, SQL support for full-text searching follows the SQL-3 functional methodology for full-text syntax extensions.

The primary SQL extension consists of the CONTAINS and FREETEXT predicates. These predicates are used to find column values that match special full-text query criteria.

To be consistent with similar features in other products and to make these predicates more extensible, functional notation is used. The high-level syntax is as follows:

Cc966481.sqc24006(en-us,TechNet.10).gif

The flexibility of the functional style of these two predicates allows easy, upward-compatible future extensions for a third parameter to designate the language used for the query. 

CONTAINS Predicate

The CONTAINS predicate determines whether the content of files contains certain words and phrases. The CONTAINS predicate syntax is as follows:

Cc966481.sqc24007(en-us,TechNet.10).gif 

property

The property to be searched. Its data type must be character-based. If property is not specified, the CONTENTS property is assumed. The value of the CONTENTS clause is the contents of the file after conversion to plain text (if necessary) by a filter. It is good practice to explicitly code the CONTENTS clause, rather than to accept it as a default.

simple_term

The term used to match the exact word or phrase being searched for. Its syntax is:

Cc966481.sqc2407a(en-us,TechNet.10).gif 

where word refers to one or more characters without spaces or punctuation and phrase refers to multiple words with spaces in between. Asian languages can have phrases made up of multiple words without any spaces in between.

Here are some examples of simple terms used in the context of the CONTAINS predicate in a WHERE clause:

WHERE CONTAINS( Contents, 'hockey' )
WHERE CONTAINS( Contents, ' "ice hockey" ')

In keeping with the standard for full-text products, the search for characters in a word or phrase is always case-insensitive, and noise words are not stored in the full-text index. For example, suppose there is one file with a value of "This is a dissertation on the use of ice-cream sandwiches as hockey pucks" and another file with the value "Dissertation on new ways of splitting the atom." Since "this," "is," "a," and so on are noise words, they are not stored in the full-text index, and the following two queries are effectively identical:

CONTAINS( Contents, ' "this is a dissertation" ' )
CONTAINS( Contents, 'dissertation' )

Both rows are returned as hits because in the first query, the noise words are removed before processing the query.

Combining Terms Using Boolean Operators

As with other SQL search conditions, more complex conditions can be specified by linking individual operands with the Boolean operators AND, OR, and AND NOT. In this case, the operands are any of the types of terms being discussed (simple_term, prefix_term, and so on). Except for the restriction that the OR NOT combination is not supported and that NOT cannot be specified before the first term, the rules are exactly the same as those used to combine individual predicates to form search conditions. For example, parentheses may be used to change the default priority order in which the operators are applied.

Here are some examples of simple terms being combined within a CONTAINS predicate in a WHERE clause.

WHERE CONTAINS( Contents, 'hockey OR curling' )
WHERE CONTAINS( Contents, 'hockey AND NOT field')
WHERE CONTAINS( Contents, ' ("ice hockey" OR curling) AND NOT Canada ')
prefix_term

prefix_term is used to match words or phrases that begin with specified text. Its syntax is:

Cc966481.sqc24008(en-us,TechNet.10).gif 

A prefix term consists of a simple term appended with an asterisk (*) to activate prefix matching on a word or phrase. All text that starts with the material before the * is matched. The wildcard symbol (*) in this case is similar to the percent symbol (%) in the LIKE predicate in that it matches zero, one, or more characters of the root words in a word or phrase. In a phrase, each word within the phrase is considered to be a prefix; for example, the term "local bus*" matches "locality busy," "local bush," and "locale bust."

Here are two examples of prefix terms used in the context of the CONTAINS predicate in a WHERE clause. The first matches values that contain the word "atom," "atomic," "atomism," "atomy," and so on. The second matches values that contain the word "wine," "vine," or words such as "winery," "wines," "vineyard," or "vinegar."

WHERE CONTAINS( Contents, ' "atom*" ' )
WHERE CONTAINS( Contents, ' "wine*" OR "vine*" ') 
proximity_term

proximity_term is used when the words or phrases being searched for must be close to one another. Its syntax is:

Cc966481.sqc24009(en-us,TechNet.10).gif 

A proximity term is similar to an AND operator in that more than one word or phrase must exist in the value being searched. It differs from AND because the relevance of the match increases as the words appear closer together.

The syntax is designed to be extensible for possible future support for specification of units of proximity such as words, sentences, paragraphs, chapters, and so on.

NEAR, NEAR() and ~ share the same meaning: the first word or phrase is close to the second word or phrase. "Close" is a purposefully vague term that can mean "within 50 words," but the algorithm is complicated. While words within the same sentence are one word distance apart, larger distances are assigned between units such as sentences, paragraphs, and chapters. Even if words or phrases are very far apart, the query is still considered to be satisfied; the row just has a low (zero) rank value. This can be avoided by specifying RANK > 0 as one of the predicates in the WHERE clause.

It is possible to chain-code the proximity matching. For example," a ~ b ~ c " means that a should be near b, which should be near c. Because of the fuzzy nature of full-text searches, it is often desirable to see the rank values. This can be done by including the RANK property in the select list of the query.

Here are some examples of* *proximity terms used in the context of the CONTAINS predicate in a WHERE clause:

WHERE CONTAINS( Contents, ' hockey ~ player ' )

This matches values that contain the word "hockey" in proximity to the word "player."

WHERE CONTAINS( Contents, ' hockey ~ "play*" ') 

This matches values that contain the word "hockey" in proximity to a word that starts with "play."

WHERE CONTAINS( Contents, ' "great*" ~ "Mike Nash" ') AND Rank > 0

This matches values that contain words starting with "great" in proximity to the phrase "Mike Nash." Values that meet the criteria but have a ranking of 0 do not have rows returned.

generation_term

generation_term is used when the words being searched for need to be expanded to include the variants of the original word. Its syntax is:

Cc966481.sqc24010(en-us,TechNet.10).gif 

The INFLECTIONAL predicate means that plural and singular forms of nouns and the various tenses of verbs will be matched. A single term will not match both exclusive noun and exclusive verb forms. The syntax is designed to be extensible enough to handle other linguistically generated forms, such as derivational, soundex, and thesaurus.

Here is an example of a generation term used in the context of the CONTAINS predicate in a WHERE clause.

WHERE CONTAINS(' FORMSOF (INFLECTIONAL, skate) ' )

This matches values that contain words such as "skate," "skates," "skated," and "skating."

weighted_term

weighted_term is used for queries that match a list of words and phrases, each optionally given its own weighting. Matching values must match only one element in the list. Its syntax is as follows, where n.nnn represents a decimal constant from zero through one.

Cc966481.sqc24011(en-us,TechNet.10).gif

A row is returned if there is a match on any one of the ISABOUT elements.

Each component in the vector can be optionally weighted. The assigned weight forces a different measurement of the rank value that is assigned to each row that matches the query.

Here are some examples of weighted terms used in the context of the CONTAINS predicate in a WHERE clause.

WHERE CONTAINS( Contents, ' ISABOUT(hockey, puck, goalie) ' )

This matches article values that contain any of the words "hockey," "puck," or "goalie." The better matches will contain more than one of the words.

WHERE CONTAINS( Contents, 'ISABOUT("Canadian ice hockey" WEIGHT(1.0),
"ice hockey" WEIGHT(.5),
hockey WEIGHT(.2) ) 
' )

This matches article values that may have information about Canadian ice hockey, with higher rank values assigned to articles that have more words from the phrase.

FREETEXT Predicate

The FREETEXT predicate determines whether or not a value reflects the meaning, rather than the exact words, specified in the predicate.

The FREETEXT predicate syntax is as follows:

Cc966481.sqc24012(en-us,TechNet.10).gif

This is a simple form of natural language query, where the index engine internally breaks the freetext string into a number of search terms, generates the stemmed form of the words, assigns heuristic weighting to each term, then finds the matches.

Here is an example of a FREETEXT predicate used in a WHERE clause.

WHERE FREETEXT( Contents, ' Who have been the most valuable ice hockey
players from 1975 through 1982? ' ) 

Search Condition Support in Index Services

The search_condition supported by Index Services is slightly different from the search_condition  supported by SQL Server. Because these queries are distributed to the OLE DB provider for processing, the queries must follow the rules of the provider. The main difference is that OLE DB Provider for Index Services does not support the QUANTIFIED COMPARISON, BETWEEN, EXISTS, IN, or NULL predicates, but it does support two other predicates: MATCHES and ARRAY COMPARISON. These predicates are not yet directly supported by SQL Server.

Following is an introduction to the search_condition syntax as supported by Index Services. Some aspects have been omitted, and the syntax of other aspects is incomplete. For more information, see the Index Services documentation.

The search_condition syntax is as follows:

Cc966481.sqc24013(en-us,TechNet.10).gif

Cc966481.sqc24014(en-us,TechNet.10).gif 

MATCHES and ARRAY COMPARISON Predicates

This section briefly introduces the MATCHES and ARRAY COMPARISON predicates. For more information, see your Windows NT documentation.

The MATCHES predicate provides more extensive pattern matching than the LIKE predicate. This added functionality bears the burden of a more complicated set of rules. The complete syntax and rules are not described here; rather, here are some examples to illustrate the use of this predicate.

Here is a grouped match against more than one pattern, where it is known that the author's first name is Peggy, but the spelling of her second name is uncertain.

WHERE MATCHES( DocAuthor, 'Peggy |(MacK|,McK|,MacC|,McC|)arson' ) 
Here, it is uncertain if Pellow is spelled with one "l" or two.
WHERE MATCHES( DocAuthor, '* Pel|{1,2|}ow' ) 

The ARRAY COMPARISON predicate is for use with the Index Services Vector properties. Some of the property values filtered by Index Services are multivalued. The data type of such values is a variable size array. SQL Server does not yet support such data types, but SQL-3 does. The SQL extension supported by OLE DB Provider for Index Services is consistent with that in SQL-3. For example, the virtual table contains a number of PowerPoint presentations, and you want to know the path to the presentations that contain any slide called "CONTAINS predicate," "FREETEXT predicate," or "Query Transformation."

SELECT * 
FROM OpenQuery(FileSystem,
'SELECT Path
FROM SCOPE('' "/slide_presentations" '')
WHERE DocPartTitles = SOME ARRAY[ ''CONTAINS predicate'',
''FREETEXT predicate'',
''Query Transformation'' ] 
') 

For more information, see your Windows NT documentation.

Sample Full-Text Queries

This section provides several examples that answer queries by combining database data and file data.

The following query returns the title and publication year of qualifying books that are represented by files in the virtual directory that has the /Pubs alias. In order to qualify, a book must cost less than $20.00, and text in the Characterization property must indicate that the book is about ice hockey. It is known that the year portion of the Create property is always the publication year of the book. The customer has defined the BookCost property (of type money), which filters out the cost of each book.

SELECT Q.DocTitle, DATEPART(year, Q.Create) 
FROM OpenQuery(FileSystem,
'SELECT DocTitle, Create
FROM SCOPE('' "/pubs" '')
WHERE BookCost <= 20.00
AND CONTAINS( Characterization, '' "ice hockey" '' )
') AS Q 

Notice that the table alias value of Q has been assigned to the table returned by the OPENQUERY() function. This alias is then used to qualify the items in the outer select list. Previous examples specified the SELECT * statement and passed on all values returned by the inner SELECT statement. Here, the SQL Server DATEPART() function is used to pass on only the year portion of the create datetime value.

The following query returns the same information as the previous query. The difference is that the price of the book is obtained from the document_cost column in the BookCost table in the database, rather than from a property in the file system. The primary key of the BookCost table is the combination of the document_author and document_title columns.

SELECT Q.DocTitle, DATEPART(year, Q.Create) 
FROM OpenQuery(FileSystem,
'SELECT DocTitle, Create, DocAuthor, DocTitle 
FROM SCOPE('' "/pubs" '')
AND CONTAINS( Characterization, '' "ice hockey" '' )
') AS Q, 
BookCost as B
WHERE Q.DocAuthor = B.document_author
AND Q.DocTitle = B.document_title
AND B.document_cost <= 20.00 

The table returned by the OPENQUERY() function is joined to the real BookCost table in the database, then rows with a suitable cost are filtered for inclusion in the outer SELECT statement.

The next query also joins data from the file system and the database, and this time, data from both appears in the outer SELECT list. Furthermore, the Rank property, which indicates how well the selected rows met the selection criteria, appears in the select list and is used to ensure that higher-ranking rows appear before lower-ranking rows in the outer SELECT statement. In this example, the wording on the plaques in the Hockey Hall of Fame is recorded on files. There is a file for each plaque, and the plaque number can be obtained through the DocSubject property. The HockeyHall table contains PlaqueNo, PlayerName, StartYear, and LastYear columns, with the primary key in the PlaqueNo column. You want to return the PlayerName and PlaqueNo columns from the table and the Rank and DocComments properties from the file. Only players who might have played for Canadian or U.S. teams in the early 1900s are to be returned.

SELECT HH.PlayerName, HH.PlaqueNo, Q.Rank, Q.DocComments
FROM OpenQuery(FileSystem,
'SELECT DocSubject, DocComments, Rank 
FROM SCOPE('' "/hall_of_fame" '')
WHERE CONTAINS( Contents, '' Canada OR "United States" 
'' )
') AS Q, 
HockeyHall as HH
WHERE Q.DocSubject = HH.PlaqueNo
AND HH.StartYear < 1915
AND HH.EndYear < 1899
ORDER BY Q.Rank DESC 

In the next example, an international construction company stores a large number of onsite progress reports in a Microsoft Word document in a central site. All the documents have been registered within the /Site_report virtual directory. Each document can be identified by its unique FileIndex property. These documents are tracked in the database. The following tables in the database are of interest.

Projects: project_number char(8) primary key,
project_name nvarchar(40),
project_leader smallint, --employee number
budgeted money,
spent money,
...
Employees: employee_number smallint primary key,
employee_name nvarchar(40),
nationality nvarchar(20),
...
Reports: project_number char(8),
file_index decimal(19,0), --link to the file
...
primary key is project_number and file_index

A rush order has been issued for 5,000 saunas in a heavily forested area with no electricity. The salesperson vaguely recalls a similar, successful project about 10 years ago. He issues a query that returns the project number, the paths to the onsite reports, and the ranking value of the projects managed by someone from a Scandinavian country/region that came in under budget between 8 and 12 years ago.

The highest ranking is given if the onsite report contains the phrase "wood burning" in proximity to "sauna." Points are also given if the document contains the phrase "Northern Ontario" or the word "island." Here is an example of what the query could look like:

SELECT P.project_number, Q.path, Q.Rank
FROM OpenQuery(FileSystem,
'SELECT FileIndex, Path, Rank, Write 
FROM SCOPE('' "/site_reports" '')
WHERE CONTAINS( Contents, 
'' ISABOUT( Sauna ~ "wood burning" WEIGHT (.9),
"Northern Ontario" WEIGHT (.4),
Ontario WEIGHT (.2),
island WEIGHT (.2) )
'' )
AND Rank > 5
') AS Q, 
Projects AS P,
Employees AS E, 
Reports AS R
WHERE Q.FileIndex = R.file_index
AND R.project_number = P.project_number
AND P.project_leader = E.employee_number
AND E.nationality IN ('FINNISH', 'DANISH', 'SWEDISH', 'NORWEGIAN')
AND P.spent < P.budgeted
AND YEAR(Q.Write) > 1986 AND YEAR(Q.Write) < 1992

SQL Extensions for Site Server Full-Text Queries

Microsoft Site Server version 3.0 also ships an OLE DB provider, the primary purpose of which is to allow users to write ADO application programs to query Web data. The OLE DB Provider for Site Server has not yet been tested with the SQL Server 7.0 query processor for distributed queries, and there is no official support for interoperability with SQL Server 7.0. However, users who want to experiment with this configuration will find the following information useful.

As with the OLE DB Provider for Index Services, the sp_addlinkedserver stored procedure is used to register OLE DB Provider for Site Server. For example, to register this provider for the WebTest text search catalog on the same computer that SQL Server is running on, the following statement must be executed:

EXECUTE sp_addlinkedserver WebData, 'Site Server', 'MSSEARCHSQL', 'WebTest'

These are the definitions of the syntax terminology:

WebData

The linked server name assigned to this particular linked sever.

Site Server

The product name of the data source.

MSSEARCHSQL

The provider name (PROGID) of OLE DB Provider for Site Server.

WebTest

The name of the text search catalog that will be used for this linked server.

Notice that Site Server has additional syntax to support the union of results generated by a query across multiple catalogs. For more information, see your Site Server Search documentation.