Full-Text Indexing and Querying Process
The indexing component of Full-Text Search is responsible for the initial population of the full-text index, and the subsequent update of this index when the data in the full-text indexed tables is modified. The architecture of the full-text gathering mechanism has been improved in Microsoft SQL Server 2005 in order to make the full-text indexing process more efficient, leading to significant performance improvements.
When a full-text population (also known as a crawl) is initiated, the Database Engine pushes large batches of data into memory and tells the Microsoft Full-Text Engine for SQL Server (MSFTESQL) service to begin indexing. The MSFTESQL service indexes character and formatted binary data from a column, or columns of a table. Using a protocol handler component, the full-text engine pulls the data from memory through to be further processed, resulting in a full-text index.
When indexing data stored in a varbinary(max) or image column, the filter, which implements the IFilter interface, extracts text based on the specified file format for that data (for example, Microsoft Word). In some cases, the filter components require the varbinary(max), or image data to be written out to the service account Temp directory, instead of being pushed into memory.
As part of processing, the gathered text data is passed through a word breaker to separate the text into individual tokens, or keywords. The language used for tokenization is specified at the column level, or can be identified within varbinary(max), image, or xml data by the filter component.
Additional processing may be performed to remove noise words, and to normalize tokens before they are stored in the full-text index or an index fragment.
When a population has completed, a final merge process is triggered that merges the index fragments together into one master full-text index. This results in improved query performance since only the master index needs to be queried rather than a number of index fragments, and better scoring statistics may be used for relevance ranking.
|The master merge can be I/O intensive, because large amounts of data must be written and read when index fragments are merged, but it does not block incoming queries.|
A full-text query sent from a client goes to the SQL Server Query Processor (QP) in the SQL Server process. The QP passes this on to the full-text query component, which creates an OLE DB command tree and sends it on to the Microsoft Full-Text Engine for SQL Server (MSFTESQL) service. In the MSFTESQL process, the Full-Text Engine Query Processor processes the query using thesaurus and noise word files, as well as word breakers and stemmers. After processing this query, the MSFTESQL service returns a result set to the SQL Server process. This result set can be used for further processing or returned to the client.