How word breakers, stemmers, and noise word files affect search results (Office SharePoint Server 2007)

Applies To: Office SharePoint Server 2007

This Office product will reach end of support on October 10, 2017. To stay supported, you will need to upgrade. For more information, see , Resources to help you upgrade your Office 2007 servers and clients.

 

Word breakers, stemmers, and noise word files, also known as stop word files, are all components that are part of the indexing and querying processes.

In this article:

  • Word breakers

  • Stemmers

  • Noise word files

Word breakers

A word breaker is a component that is used to break strings of text into individual words during the indexing and querying processes. During the indexing process, text is extracted from content items as an unbroken string of characters. Word breakers reestablish where each word in the string of characters begins and ends. Additionally, word breakers separate compound words so that users receive a query result on a portion of the original compound word and also on the individual terms that compose the compound word. Word breakers also convert numbers and dates from content items into a standard form.

Each language has a different word breaker. The indexing engine determines which word breaker to use and, if more than one language is detected, can use more than one word breaker for text that comes from a single document. If a word breaker does not exist for a particular language, the neutral word breaker is used.

Word breakers are also used by the query engine. When a user submits a query, a word breaker is used to break apart compound words and phrases. This increases the chances that the user’s query can be matched to terms in the content index. During a query, the word breaker’s language is determined by the language of the user’s Web browser.

By default, Microsoft Office SharePoint Server 2007 installs the word breakers that are listed in the following table on each server in a SharePoint farm.

Arabic

Hungarian

Punjabi

Bengali

Icelandic

Romanian

Bulgarian

Indonesian

Russian

Catalan

Italian

Serbian_Cyrillic

Croatian

Japanese

Serbian_Latin

Czech

Kannada

Slovak

Danish

Korean

Slovenian

Dutch

Latvian

Spanish

English

Lithuanian

Swedish

Finnish

Malay

Tamil

French

Malayalam

Telugu

German

Marathi

Thai

Greek

Norwegian_Bokmaal

Turkish

Gujarati

Polish

Ukrainian

Hebrew

Portuguese

Urdu

Hindi

Portuguese_Brazilian

Vietnamese

Stemmers

A stemmer is a component that finds the root word of a term and can also generate variations of that term. For example, in English, if a query contains the word “bought”, the stemmer can add the root term “buy” to the query and can also generate other forms of this term such as “buys”, and “buying” to add to the query.

Stemmers are language-specific and can have different capabilities depending on the language they support. Some stemmers find the root word but do not generate additional forms of words. By default, stemming is turned off during queries for many languages. You can enable stemming for search queries in the Search Core Results Web Part.

Note

Every language that has a word breaker has a stemmer, if the language can support stemming. For some languages, stemmers are installed but not enabled. To enable these stemmers, you must edit the registry. For instructions on how to enable stemmers for these specific languages, see How to turn on word breakers and stemmers in SharePoint Server 2007 (https://go.microsoft.com/fwlink/?LinkId=141180).

Noise word files

Some words in a language are not useful when you perform searches. For example, in the English language, words such as “the” and “an” provide little search value because almost every document written in English will contain these words. Words that provide little search value are called noise words, also known as stop words. During the indexing process, noise words are removed to keep indexes smaller, which can increase performance. Noise words are contained in language-specific text files that you can edit. Removing or adding words to a noise word file requires a full crawl of the content. For more information, see Edit a noise word file (Office SharePoint Server).

Noise word files have significantly changed from previous versions of SharePoint products. Many noise words that were previously included in noise word files are removed from Office SharePoint Server 2007 noise word files and are included in content indexes. By default, users can perform queries on words that were previously excluded as noise words. These queries are called noise word queries. You can disallow these searches in the Search Core Results Web Part. Additionally, if a quoted string in a query includes a noise word, the noise word can be replaced with any word in the query results. For example, if a query includes “configure a server“, content items that contain “configure the server” and “configure every server” are included in the query results.

Important

Do not remove all of the words in a noise word file. A noise word file must have at least one entry in it, even if the entry is only a period (.) character.

See Also

Concepts

Manage settings to improve search results (Office SharePoint Server)
Configure authoritative pages (Office SharePoint Server)
Add keyword terms with Best Bets (Office SharePoint Server)
Edit a noise word file (Office SharePoint Server)
Edit a thesaurus file (Office SharePoint Server)
Create a custom dictionary (Office SharePoint Server 2007)