Managing External Content in Microsoft Office SharePoint Portal Server 2003

Archived content. No warranty is made as to technical accuracy. Content may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist.

Published: June 9, 2004

This is a sample chapter from the Microsoft SharePoint Products and Technologies Resource Kit. You can obtain the complete resource kit (ISBN 0-7356-1881-X), which includes a companion CD-ROM, from Microsoft Press.

One requirement for building a robust information management system is to pull together information that is located in disparate data islands. Microsoft Office SharePoint Portal Server 2003 does an outstanding job of helping you pull that information into the portal site without having to actually move it to the portal site’s database.

SharePoint Portal Server accomplishes this through the crawling and indexing features—features that enable us to extract data from external sources of content (called content sources) and then place that extracted data in text files (that cannot be edited) that can be searched with the results displayed in a result set. Therefore, by simply crawling and indexing information, you can greatly expand the usefulness of the portal site in assisting users when they need to find information quickly and easily.

The tools that we’ll discuss in this chapter focus on both administrative and user- oriented subjects. For administrators, we’ll focus on the creation and management of content sources, search scopes, source groups, and index files. We’ll also discuss ways to craft the result set for your end users, including the use of the thesaurus, the noise word file, keywords, and Best Bets.

From an end-user’s perspective, we’ll focus on how queries are executed in the Search Web Part and outline some considerations when training your end users for this activity in the portal site. First, let’s start by differentiating between Basic Search Administration Mode and Advanced Search Administration Mode in the Search Administration pages.

On This Page

Advanced Search Administration Mode
Creating and Managing Content Sources
Working with Content Indexes
Managing and Editing Search Scopes and Source Groups
Windows SharePoint Services Search and MSSearch
The Topic Assistant
Manage Crawls of Site Directory
Manage Keywords and Best Bets
End-User Experience
Attribute Mapping for Advanced Search
Summary

Advanced Search Administration Mode

By default, when SharePoint Portal Server 2003 installs, the Basic Search Administration Mode is installed with it. This mode (shown in Figure 22-1) does not include an interface for working with the index files or the source groups. In addition, you’ll need this mode turned on to view the gatherer logs. When you enable Advanced Search Administration Mode, you will receive the ability to create and manage index files, work with source groups, and view the gatherer logs.

Cc750126.f22xr01(en-us,TechNet.10).gif

Figure 22-1 Basic Search Administration Mode user interface

To enable Advanced Search Administration Mode, click Enable Advanced Search Administration Mode on the Configure Search And Indexing page (also shown in Figure 22-1). Remember that this is a one-way, one-time change that is user-specific and cannot be reversed.

Creating and Managing Content Sources

SharePoint Portal Server 2003 content aggregation can extend past its physical server farm and actual portal site through the use of content sources. A content source is really a set of rules that informs the gatherer service about where to connect to crawl content that is hosted on servers outside the portal. The content can be located in a multitude of places that are outlined in Table 22-1. SharePoint Portal Server 2003 can crawl the following content types in content sources:

Table 22-1. SharePoint Portal Server 2003 Content Sources

Content Source

Description

Microsoft Exchange Server public folders

SharePoint Portal Server 2003 can crawl Exchange Server folders, including messages, discussions, and collaborative content. Including Exchange Server folder content in a SharePoint Portal Server can open Exchange-based discussions and collaborative content.

File share

File shares are still in wide use in many organizations. SharePoint Portal Server 2003 can use file shares as content sources. For example, the WoodGrove Bank SharePoint Portal Server 2003 implementation can use the following file shares as content sources:

\\WoodGroveBank\shareddocs\ file://WoodGroveBank/shareddocs/

Web content

SharePoint Portal Server 2003 can crawl a variety of Web content as content sources, including singe static Web pages, entire websites, SharePoint Portal Server sites, and Microsoft Windows SharePoint Services sites content, including HTML content, documents, and lists.

Windows SharePoint Services–based sites and SharePoint Team Services–based sites

SharePoint Portal Server 2003 can index content hosted in these two site platforms. This indexing feature in the site itself is a SQL-based full-text indexing engine, which is different than the MSSearch.exe indexing engine that we are discussing in this chapter.

Lotus Notes database

Content stored in Lotus Notes databases is not off limits to SharePoint Portal Server 2003 as a content source. Using a Lotus Notes database as a content source requires the Lotus Notes administrator to make some configuration changes to the Lotus Notes Index Management Server, and then to make configuration changes to the Lotus Notes protocol handler.

You can create as many content sources as you need, but once you get above a few hundred content sources, the user interface will become unworkable. However, you can still manage this many and more by using the command-line parameters. The SharePoint Portal Server object model is the recommended method of working with content sources in large server farms with hundreds or thousands of content sources.

In smaller installations, the more content sources you have, the more scheduling you’ll need to manage because you won’t want all your content sources firing at the same time to index their respective data. While it is not unusual for organizations to have hundreds of locations from which they could potentially crawl content, it is also conceivable that creating several hundred or even thousands of content sources would present significant resource needs and (perhaps) administrative difficulties.

Therefore, you’ll need to balance several competing needs as you work with SharePoint Portal Server. First, you’ll need to ensure that you crawl only data that you really need in your index. For example, if you need to crawl 100 documents that are located in a file share hosting 3,000 documents, crawling all 3,000 documents to get the data from the 100 documents into your index would be foolish. Your index would be filled with data that your users wouldn’t want appearing in their result sets. Indexing needless information only clutters your index and the result sets, leading to a less positive end-user experience when they use the portal site’s search functionality.

So, what would you do? A best practice is to move these 100 documents into their own file share and crawl that file share individually. Remember the old adage: garbage in, garbage out. If you fill your index with needless information, the result sets will not be tight and pinpointed toward what the user is really after. Keeping your indexes clean and trim will help lead to a positive end-user experience when they input a search query. Remember that the goal of creating content sources is to enable your users to find information quickly and easily. A cluttered index leads to a result set that forces your end users to hunt through the list to find the information they are after. Such hunting is neither quick nor easy, and it’s likely to “turn off” your end users from wanting to use the search features in the portal site. This is why, later in this chapter, we’ll spend time looking at ways to craft the result set for your end users.

Second, you’ll need to ensure that content sources are not created on a whim or at every request from a user. It is conceivable employees from your sales department will decide that they’d like to have a host of sale-oriented articles on a website indexed for their own use. While SharePoint Portal Server can certainly handle this task, it might not be wise to set up content sources at the request of each user because you could end up creating many additional content sources that are really unnecessary. A best practice to guard against this is to create a SharePoint Portal Server planning team that lives in perpetuity and that approves the creation of new content sources.

Third, you’ll need to ensure that you create enough content sources to pull in the information your users need to perform their jobs and collaborate effectively, but not so many that you can’t manage them. Content source management can become a real issue in information-intensive environments. Hardware resources can be unnecessarily taxed, and without proper planning you can actually degrade the performance of your indexing server when you try to crawl a content source.

For example, one procedure that occurs every night on nearly every server in most environments is the backup procedure. There are others, such as antivirus scanning and disk defragmenting. But for purposes of our conversation here, we’ll focus on the backup procedure in our example. Because the crawling function is highly processor- and RAM-intensive on both the crawling server and the content source server, a best practice is to schedule your crawl schedules around the other servers’ activities.

So, if you have 30 content sources that you want to crawl, a best practice is to ascertain (as best you can) the regular routines that are run on that server and when they are run, and then schedule the index update builds during times when that server is not taxed by routines or client demand. Doing this can be a tall order, but if those servers are in different time zones, having this information might widen the crawling window for you. Therefore, you’ll need to think through what content really needs to be crawled and indexed versus how many content sources you’ll have, when the sources can be crawled, and how you’ll do this without creating bottlenecks on either server during peak crawling periods.

Default Content Sources

SharePoint Portal Server 2003 installs with several default content sources, as follows:

Portal Site Content (This Portal).

  • An incremental update of portal site content is conducted every 10 minutes in the background, and an incremental (inclusive) update is performed each night. Portal site content includes content hosted in areas plus linked content via portal listings.

People Content (People).

  • An incremental update of people content occurs every 60 minutes in the background. This content source crawls people profiles and personal sites and includes both public and private documents. However, remember that permissions are applied to result sets before they are presented to the user so that the user will see only documents for which the user has permissions.

Non–Portal Site Content (Site Directory).

  • An incremental update of non–portal site content is conducted every night to ensure that all non–portal site content is included in the index. Non–portal site content includes all site collections created in the Sites Directory, but it can include other content sources as configured by the portal administrator or administrators.

The SharePoint Timer service works in conjunction with the Scheduled Tasks on the local server to run these (and other) scheduled events. You can learn about the default times by opening Control Panel and then Scheduled Tasks. In that location, you’ll see the various scheduled tasks for crawling portal site and non–portal site content. If you have set up other schedules, such as audience compilations or user profile imports, those will appear here too. You can change the timing of such scheduled events by using the Schedule tab in the properties of the scheduled event.

Creating a New Content Source

Whether you’re creating or managing a content source, you’ll do this in the same location: the Manage Content Sources page. You’ll navigate to this page by clicking Site Settings, clicking Configure Search And Indexing, and then clicking Manage Content Sources in the Other Content Sources section.

The Site Settings page governs search settings and indexed content integral to searching and indexing across a SharePoint Portal Server 2003 implementation. Table 22-2 shows the page sections for SharePoint Portal Server site settings.

Table 22-2. SharePoint Portal Server 2003 Site Settings Page Sections

Section

Description

Configure search and indexing

Governs the configuration of search and indexing on the local SharePoint Server.

Use search scope from another portal site

Enables you to use the search scope from an existing SharePoint Portal Server 2003 implementation. This feature is useful if you are deploying multiple SharePoint Portal Server 2003 implementations across an organization and you want to standardize the search scope matrix in each portal site.

Manage properties from crawled documents

Enables you to manage properties from documents you crawl with SharePoint Portal Server 2003.

Manage crawls of the Site Directory

Enables you to manage the crawls of all the sites to be included in your search results.

Manage keywords

Enables the management of keywords throughout your portal site.

To create a new content source, click Add Content Source on the Manage Content Sources page (shown in Figure 22-2).

Cc750126.f22xr02(en-us,TechNet.10).gif

Figure 22-2.SharePoint Portal Server 2003 Manage Content Sources page

At this stage, you’ll need to specify what type of content source you want to create:

  • Exchange Server content. Using Exchange Server content as a content source enables you to tie SharePoint Portal Server into Exchange Server content such as public folders, public e-mail list folders, and other collaborative content.

  • File share enabling you to tie SharePoint Portal Server 2003 into your existing shared network drives.

  • Web content Web pages, websites, Windows SharePoint Services team sites, other SharePoint Portal Server 2003 portal sites, and sites residing on previous versions of SharePoint Portal Server and SharePoint Team Services.

  • Site Directory

  • Lotus Notes databases

Crawling Web Content Sources

When you create a new content source (and even after you’ve created it), you can specify how “deep” the content source should crawl the content. When it comes to websites, you can configure several different options:

  1. Click This site—follow links to all pages on this site.

  2. Click This page only.

  3. Click Custom - specify page depth and site hops. If you click the custom option, you can limit the page depth and the site hops. To do this, select the Limit page depth and Limit site hops check boxes, and then specify the limits.

The page depth is the number of links followed within sites. A site hop occurs when a link from one website leads to another website. If you specify that the number of site hops on a website content source be unlimited, SharePoint Portal Server 2003 can access an unlimited number of sites through the initial site. If you choose to reduce the page depth, three full updates must occur before any previously crawled pages are excluded. Three full updates are hard-coded into the product to ensure that temporarily unavailable content is not prematurely removed from the index while generating false positive and unnecessary alerts.

You can also select the Participate In Adaptive Updates check box to include this content source in adaptive updates. Note that if you select the Participate In Adaptive Updates check box, changes will show up more quickly in search results, but updates will tax more server resources. Adaptive updates are described in Chapter 21, “The Architecture of the Gatherer.” Thereafter, some of the other duties you’ll need to accomplish when setting up a new content source will be performed in these steps:

  1. Select a source group if you have enabled Advanced Search Administration Mode. In the Source Group section, perform one of the following options:

    • Type a description of the source group for this content source if you want to create a new source group for this content source in the Source group box.

    • Select one of the existing source groups if you want to use an existing source group for this content source.

  2. Perform one of the following steps to confirm or specify updates and rules for Web content as a content source:

    • Click OK.

    • Click Advanced. Specify rules to include or exclude content, specify scheduled updates, or start an update on the content_source_type content source page.

Managing Rules for Including or Excluding Content

You can create rules that include or exclude content from the content index. These rules are called site restrictions and site path rules. A site restriction rule is the main rule for a site. You can show or hide the other rules for a site by clicking the plus sign (+) or minus sign (- ) next to the site restriction. The other rules for a site are called site path rules. The site restriction defines the overall rules for a site, and the site path rules are rules for specific parts of the site. The Site Path rules are nested inside the Site Restrictions rule.

Site Path Rules are evaluated in the order they appear in the list. If there is a site rule match, all the path rules are evaluated and applied in the order they appear.

You can use site restrictions and site path rules to accomplish the following tasks:

  • Override the settings for the default content access account when crawling a specific site or path

  • Specify the granularity for crawling lists

  • Allow crawling of sites where addresses pass parameters—for example, the address includes a question mark (?)

  • Allow sites to be traversed for links without content being added to the index

  • Exclude an area from the index completely

Including or excluding content is a best practice for tuning SharePoint Portal Server search capabilities, and it’s a good tool to offer the best possible search results that meet your business requirements.

Rules can use general expressions and wild cards, as shown in the following examples:

  • “https://woodgrovebank/folder*” applies to all Web resources that have a URL that starts with “https://woodgrovebank/folder”

  • “https://server?web*” applies to resources such as “https://serveraweb2/file.htm” and “https://serverbweb3/file.htm”

  • “*/*.doc” applies to every Microsoft Office Word document encountered

Document shortcuts are subject to the same site and path restrictions as other documents and content sources in the portal site. If a user adds a document shortcut to the portal site, SharePoint Portal Server 2003 updates that shortcut in the same way as other content sources. If site or file type restrictions prohibit the inclusion of a shortcut in the index, SharePoint Portal Server does not include content from that document shortcut in the index.

The settings described by these rules will become effective only after a new crawl occurs. If you change rules during a crawl, any content that has not been crawled yet and that is described by the rule will be affected by the changes.

You can add a rule to include specific paths in the content index, to exclude specific paths from the index, to specify how SharePoint lists are handled, or to provide a specific account to access a specific path.

Add a rule that includes or excludes content

Perform the following steps to add a rule that includes or excludes content from your SharePoint Portal Server 2003 searches:

  1. On the Configure Search And Indexing page, in the General Content Settings and Indexing Status section, click Exclude and Include Content.

  2. On the Exclude And Include Content page, click Add Rule.

  3. On the Add Rule page, in the Path box, type a path for the content affected by this rule. You can use general expressions and wildcard characters to define which resources are subject to this update rule.

  4. In the Crawl Configuration section, perform one of the following four options:

    • To exclude all documents in this URL space, click Exclude all items in this path. If you select this option, when the search component encounters a resource within this space, it will neither crawl the resource nor follow links contained within the resource.

    • To include all documents in this URL space, click Include all items in this path. If you select this option, you can also do the following:

      1. To suppress the inclusion of a page of links but still crawl the content that the page links to, select the Include linked content, but do not include source check box.

      2. To follow complex links (URLs that include question marks (?) followed by parameters), select the Follow complex links check box.

    • To allow alerts on individual SharePoint list items, click Allow alerts on individual SharePoint list items. This option will allow alerts to be generated when individual list items in Web Parts are changed, added, or deleted. By default, SharePoint lists are crawled as one item. If you do not click Allow alerts on individual SharePoint list items, alerts for the list will be sent if any item in the list is changed.

    • To crawl each SharePoint list item individually, click Index SharePoint list items individually.

  5. In the Specify Authentication section, do one of the following:

    • To use the default content access account, click Use default crawling account to log in. If the default crawling account cannot access this path, enter a user name and password combination that can crawl the content source.

    • To prevent Basic authentication from being used, select the Prohibit the server from passing plain text passwords check box. The server will attempt to use NTLM authentication.

  6. Finally, to use a client certificate for authentication, click Specify client certificate, and then select a certificate from the list.

Working with Content Indexes

A content index is a full-text index of content stored on a SharePoint Portal Server 2003 portal site. These content indexes do not include the indexes from the Microsoft SQL Server Full Text Index engine that can be used in each Windows SharePoint Services site. The content indexes are populated by the Indexing Service as it receives crawled content from the Gatherer Service.

Content indexes do not have a hard-coded limit. Successful testing has reached 5 million documents indexed to a single index file.

Creating a Content Index

The reasons for creating additional index files are a bit complex. You’ll need to bear in mind a couple of competing issues when you decide to create a new index file.

First, bear in mind that if a user’s search spans multiple index files, this will increase the load on your server when the query is executed and it will also take more time to search each index file and then compile a single result set.

Second, ranking is performed within each index file but is not performed again in the aggregate result set. So, if a search query pulls records from multiple index files, there will be no overall ranking of the result set. Therefore, the best practice here is to try, if you can, to limit the number of index files that you create. However, if you find yourself creating a plethora of index files, group the content sources together in such a manner so as to limit the number of times most search queries will need to traverse multiple index files.

Third, if an index file should become corrupt (which can occur if your server suddenly loses power or if there is a read/write failure in your hardware), bear in mind that if you reset the index, you’ll be starting simultaneous, full-index updates on each content source that writes to that index file. If you have a high number of content sources writing to that index file, it is possible to overload your server with the execution of multiple full- update builds to rebuild the content in that index file.

Fourth, if you’re running in a server farm environment where the indexes are created on an Index server and then propagated to a Search server, bear in mind that the larger the index file, the longer the copy operation will last between the Index server and the Search server. When the Index server is crawling a content source, that information is being written to the local copy of the index on the Index server. Only after the crawling operations are completed will the index file be copied from the Index server to the Search server. Because this is a normal file copy operation, the larger the file, the more time that will be required to copy this file from the Index server to the Search server. Depending on available bandwidth between servers, this could end up being an hour or more of copy time between servers for extremely large index files. For the portal site content incremental update that occurs every 10 minutes, this could mean that your copy operations do not finish before the next incremental crawl commences. A best practice is to perform a test copy of the largest index file to discern the amount of time required to copy a file from the Index server to the Search server and then verify that the copy time is within acceptable limits.

If you need to create a new index file, perform these steps:

  1. On the Configure Search And Indexing page, in the Content Indexes section, click Manage content indexes.

  2. On the Manage Content Indexes page, click New Content Index.

  3. On the Create Content Index page, in the Name and Information section, perform the following: type a unique name for this index in the Name box. The name must be unique for this portal site. The description is not required to be filled in and is there for your purposes.

  4. In the Source Group box, type the source group name. This name should be thought out in advance of creating the index file because source groups play a pivotal role in the creation of search scopes.

  5. In the Server list, select the server on which the index will reside.

  6. Click OK.

    Cc750126.tip(en-us,TechNet.10).gif  Universal Naming Convention (UNC) names are not accepted here. The address must be a valid file system path for the server on which the index is being created.

Editing the Properties of a Content Index

Editing the properties of a content index takes place when you need to modify index-specific information such as description, source group, status, and logging options. Such options might have to be changed, depending on the particular requirements of your organization. The logging information available as part of content-index properties is also useful in troubleshooting any content-index issues you might encounter.

In the properties of a content index, you can learn the following information:

Name and description of the content index.

  • The name is not configurable, but the description is.

Source group.

  • You can see to which source group the index file is assigned and assign the file to a new source group if needed. To assign the file to a new source group, simply type in the name of the new source group and then click OK at the bottom of the screen.

Status information.

  • The status information includes the file size, the number of documents in the index, the last update, warnings and error message counters, and a link to the gatherer log for this index file.

Logging options.

  • These two check boxes will allow you to specify additional information that should be placed in the gatherer logs. If documents that you think should appear in the index are not appearing, enable these two selections, rerun a full index, and then view the gatherer log for additional troubleshooting information.

Excluded and included paths.

  • You can view the number of included and excluded paths and then click on a link that will allow you to further modify these rules.

Because this information is held in the properties of the index files, you’ll want to enable Advanced Search Administration Mode to quickly open the properties of these index files.

Managing Content Indexes

Managing content indexes is an important element of SharePoint Portal Server 2003 site maintenance because resetting and updating content indexes keeps the index up to date and ensures a successful search experience based on the latest portal site content. The management tasks include:

  • Resetting a content index and thus stopping index updates and emptying the selected content index entirely. This is an option best used if you suspect file corruption or other issues with the selected content indexes. Resetting the content index file will automatically start a full update for all content sources assigned to that file.

  • Starting, stopping, or pausing index build jobs.

  • Deleting a content index. You’ll delete a content index file only when the file is no longer needed and there are no content sources assigned to that index file.

Content index management is performed from the Manage Search Settings And Indexed Content section of the Site Settings page in SharePoint Portal Server 2003.

Resetting a Content Index

Resetting a content index is the best way to empty an index file and force a full-update index to run for all the content sources that are using that index file. You’ll want to do this if the index file becomes corrupted or if you want to quickly start a manual full update of all assigned content sources.

Perform the following steps to reset a content index:

  1. On the Manage Content Indexes page, rest the pointer on the index name and then click the arrow that appears.

  2. On the menu that appears, click Reset Content Index.

  3. On the message box that appears, click OK.

Resetting the index stops any updates that are in progress and empties the index completely. In a server farm configuration, the old index exists on the search servers until index propagation has occurred.

Deleting a Content Index

Deleting a content index is performed from the Manage Search Settings And Indexed Content section on the Site Settings page. You’ll delete a content index only when you’ve either moved the content sources to another index file or if you no longer need the content in the index to appear in the search results for your users.

Cc750126.note(en-us,TechNet.10).gif  You can delete a content index only if you have enabled Advanced Search Administration Mode. You cannot delete the Portal_Content or Non_Portal_Content indexes.

Perform the following steps to delete a content index:

  1. On the Manage Content Indexes page, point to the content index name and then click the arrow that appears.

  2. On the menu that appears, click Delete.

  3. In the confirmation message box that appears, click OK.

Deleting a content index also deletes the indexes on the search servers in a server farm configuration.

Managing and Editing Search Scopes and Source Groups

Search scopes give users the opportunity to search a portion of the overall index, thereby placing less demand on the server and returning a sharper, more focused result set. Search scopes can be defined by topics and areas, by source groups, or by both.

Source groups are a new feature of SharePoint Portal Server 2003. (They are not supported in Windows SharePoint Services.) Source groups enable an administrator to group together unique combinations of content sources and index files so that they can be used in one or more search scopes. The advantage of using source groups lies in their granularity: by creating a 1-to-1 relationship between content sources and source groups, you can build search scopes that allow users to search multiple layers of granularity in their result sets.

For example, let’s suppose that you have three Windows SharePoint Services sites in the research department: Data Modeling, Chemicals, and Team Discussions. Each site has a distinct purpose, and you need to index the information in these sites so that your users can search it. Now, let’s further suppose that your users sometimes need to search just the Chemicals documents and at other times they need to search across all three teams in the research department. Through the smart use of source groups and content sources, you can accommodate both types of searches.

What you would do is create a content source to each site—one for Data Modeling, another for Chemicals, and a third for Team Discussions. In the properties of the content source (keeping in mind that you must have Advanced Search Administration Mode enabled), you would type in a unique source group name for each content source, such as Research Data Modeling Source Group, Research Chemicals Source Group, and Research Team Discussions Source Group.

Then you would create three search scopes—each one based on a source group. Therefore, you would create a Research Data Modeling Search Scope, a Research Chemicals Search Scope, and a Research Team Discussions Search Scope. Finally, you would create a search scope that encompasses all three source groups, named (perhaps) Research Department Search Scope.

In this scenario, if a user wanted to find documents related to plastics in the Chemicals site, the user could select the Research Chemicals Search Scope and execute her query against that individual index from the Chemicals Content Source. However, if the user wanted to search across all three sites for documents related to plastics, she would select the Research Department Search Scope and her search would be executed against the portion of the overall index that was built from all three of those content sources.

Implementing a robust search scope hierarchy is no easy task, and it is really more art than science. One suggestion to capture and build a hierarchy that your users will love is to capture the search terms they executed in the Internet Information Server (IIS) logs and use that to inform your scope hierarchy creation.

In addition, the tighter and more robust your search scopes are, the leaner and more accurate the result set will be when users execute a search query (assuming that they use the search scopes you developed). Accurate results coupled with a faster response time (because the query is not being executed against the entire index) will result in a better end- user experience and increase the positive reaction to your SharePoint Portal Server deployment.

It is important to include discussions about the best use of search scopes, source groups, index files, and content sources in your predeployment planning meetings. Even though they’re complex, such discussions can lead to a better deployment and a better end- user experience during the initial stages of a new SharePoint Portal Server 2003 deployment.

Adding a Search Scope

Perform the following steps to add a search scope:

  1. On the Manage Search Scopes page, on the toolbar, click New Search Scope.

  2. On the Add Search Scope page, type a name for this search scope.

  3. Decide whether you want to limit the scope by topics or other portal areas. In the Topics and Areas section, click Include all contents if this search scope is not limited by topic or area. To limit this search scope by topic or area, click Limit the search scope to items in the following topics or areas, and then click Change areas.

  4. On the Change Location page, select areas to use for this search scope. You can select one or more areas, but each selected area includes all of its subareas. Only items in the selected areas show up in search results when using this search scope. When you are finished selecting areas, click OK.

  5. In the Content Source Groups section, click Include all content sources if this search scope is not limited to certain groups of content sources. Click Exclude all content sources to limit the search scope to only the default content source for this portal site. To limit the search scope to particular content source groups, click Limit the search scope to the following groups of content sources, and then select the content source groups that apply.

  6. Click OK.

Using Search Scopes from Other Portal Sites

In a shared services environment, you might find that you’ll want all your users to use the same set of search scopes regardless of which portal site they are executing searches in. This will enable you to also set up a single list of search scopes and make them available across multiple portal sites.

To configure searches in this way, you’ll need to use the Use Search Scope from Another Portal Site link in the Site Settings of your portal site. Click the Associate This Portal To Another Portal option button, and then enter the URL for the associated portal site.

Windows SharePoint Services Search and MSSearch

The search engine that runs in a Windows SharePoint Services site is the SQL Server full- text search engine. The search engine that runs the SharePoint Portal Server search functionality is MSSearch.exe. These are two different, distinct engines that produce two different indexes. These indexes cannot be merged or shared, nor is there any support for attempting to merge these two engines to gain a common index for the portal site and all sites associated with the portal site.

The Topic Assistant

The Topic Assistant provides a way for you to automatically have items in the portal site categorized into areas based on the existing items in those areas. In other words, when properly trained, the Topic Assistant could take a document about flat-panel monitors stored in an area dedicated to content about monitors and include it in an area dedicated to LCD technology. This reduces the time and effort it takes to manage areas, allowing items on the portal site to appear in search results and the portal site map according to the areas to which they belong.

As you add items to lists and document libraries in a particular area, the Topic Assistant can learn (by looking at the index) from your manually added content in that area and other areas and then suggest items to list under alternate areas that it deems are appropriate for that content to appear in. The content manager of that area then can approve or reject these suggestions. As areas are added to the portal site and as items are added to areas, the Topic Assistant continues to learn and suggest items for each area.

The effectiveness of the Topic Assistant is highly dependent on the size of the training set, the appropriateness of the content in each area, and the level of precision set when configuring the feature. There must be a minimum of two areas configured to be included by the Topic Assistant with at least ten documents each to begin the training process. Every time the content index is crawled, the Topic Assistant makes its suggestions. The content index is crawled by selecting the Train Now link on the Topic Assistant page. A best practice is to well exceed the minimum requirements to train the Topic Assistant so that more accurate portal listings are created.

Manage Crawls of Site Directory

Whenever you create a new site collection in the Site Directory, that site collection’s URL is placed in the Manage Crawls Of Site Directory list. By default, the listing is approved and crawling of that URL is enabled.

There might be times when you’ll want to temporarily stop indexing a site collection—perhaps when the site’s administrator contacts you to ask for a temporary hold on indexing while she uploads additional information that she wants included in the next indexing process.

The default behavior is to Crawl This Site. However, you can also select to either Require Approval For Crawling or Do Not Crawl This Site from the list that is associated with the portal site listing. When you select the Require Approval For Crawling option, what you’re really doing is moving those sites out of the Approved Sites list and into the Requested Sites list. Doing so does not delete the listing itself, but rather places it in a holding pattern that requires an administrator to manually approve the listing before the content source can be crawled.

Note that site collections are not automatically crawled unless they are created through the Sites Directory in the portal. You’ll need to create a content source for the site collection and then ensure that the collection’s portal site listing is enabled in this area before you’ll enjoy a successful crawl of the site collection.

If you select the Do Not Crawl This Site option, you’re moving the listing to the Rejected Sites list. Of course, the listing can be manually re-enabled by selecting the Crawl This Site option and the effect of moving a listing to either the Rejected Sites list or the Requested Sites list is the same: a crawl will not occur on any site listed in either list. To permanently remove a listing from being crawled, click Delete.

Site collections built through the Site Directory are not included in the crawling of the default portal content source. However, they are included in the Site Directory content source, and their information is placed in the nonportal content index file.

Manage Keywords and Best Bets

Keywords are an excellent way to capture a summary of a document in a single word or phrase. SharePoint Portal Server 2003 supports the nesting of keyword lists and the association of any URL-accessed– or Universal Naming Convention (UNC)–based resource. In addition, you can create Best Bets, which is a method of specifying that a particular document or resource will be displayed in the Best Bets Web Part above the regular result set for configured search queries.

For example, let’s assume you have the keyword human resources in your keyword list. Let’s further assume that you have associated the Human Resource Policy Manual with this keyword as a Best Bet. What will happen is this: when a user enters the phrase human resource in the Search Web Part, the Human Resource Policy Manual will appear in the Best Bet Web Part above the result set. You’ll use Best Bets to manually configure certain documents or resources to appear ahead of any other resource in the result set. This feature allows you to specify certain documents and resources to appear when the user enters a keyword. This is an excellent way to craft the result set for the end user by specifying obvious documents and resources for a given search query.

End-User Experience

MSSearch uses free-text queries in the Search Web Part in SharePoint Portal Server 2003. With free-text queries, you can enter a group of words or a complete sentence. The indexing service finds pages that best match the words and phrases in the free-text query. It does this by finding pages that match the meaning, rather than the exact wording, of the query. The indexing service ignores Boolean, proximity, and wildcard operators.

You can use free-text queries to search both contents and property values. If you submit only the query text without specifying the type of query or the property, the indexing service uses the free-text query and the Contents property by default.

For example, if you enter “blue shoes” (without the quotation marks) in the Search Web Part, you will get back documents that contain the word “blue”, the word “shoes”, and both “blue” and “shoes”. In other words, search will find documents that contain either or both words.

To search for an exact phrase, enclose the phrase in quotation marks, such as “blue shoes”. When such a phrase is entered, search will return only documents that contain the exact phrase “blue shoes”.

Crafting the Result Set Using the Thesaurus

Another way to expand the set of documents that is received from a catalog is to use thesaurus files. Thesaurus files allow the user to type a phrase in a search query and receive results that are altered by the administrator. The thesaurus also enables the server farm administrator to affect search ranking by assigning weights to words. Table 22-3 lists the thesaurus files available in SharePoint Portal Server 2003.

Table 22-3. SharePoint Portal Server 2003 Thesaurus Files

Language

Thesaurus File

Chinese-Simplified

tschs.xml

Chinese-Traditional

tscht.xml

Czech

tscsv.xml

Dutch

tsnld.xml

English-International

tseng.xml

English-US

tsenu.xml

Finnish

tsfin.xml

French

tsfra.xml

German

tsdeu.xml

Hungarian

tshun.xml

Italian

tsita.xml

Japanese

tsjpn.xml

Korean

tskor.xml

Neutral

tsneu.xml

Polish

tsplk.xml

Portuguese (Brazil)

tsptb.xml

Russian

tsrus.xml

Spanish

tsesn.xml

Swedish

tssve.xml

Thai

tstha.xml

Turkish

tstrk.xml

The neutral thesaurus file is always applied to queries, in addition to the thesaurus file associated with the query language. If a query is in a language that does not have its own thesaurus file, only the neutral thesaurus file is applied.

Editing Thesaurus Files

Thesaurus files are XML files that can be edited in a text editor. When editing thesaurus files, use only well-formed XML (that is, matching opening and closing tags around each entry) or the file will not load properly. If the XML is malformed, SharePoint Portal Server logs an error in the Microsoft Windows Server 2003 event log referencing the file and line.

By default, SharePoint Portal Server stores thesaurus files in the Program Files\SharePoint Portal Server\DATA\Config directory of the server. The location of the DATA directory can be changed during the installation of SharePoint Portal Server. This directory contains one additional file called Tsschema.xml. Do not modify this file.

There are four ways to craft the result set for your end users via the thesaurus. The first way is to enter expansion sets, which essentially means that if a user searches on the word “boots”, we’re going to automatically expand that query to include the words “shoes” and “sandals” (for example). Expansion-set terms are equal, so if you place the three words “boots”, “shoes”, and “sandals” in the same expansion set, a search on any one of these three terms will expand the query to include the other two.

Common instances of when you’ll want to use expansion sets includes expanding acronyms to include their spelled-out form, finding commonly misspelled forms of the searched words, and finding common synonyms for names and terms that might exist in your industry. For example, if you have a difficult-to-spell term—such as “pyrotechnic”—in your index, you might want to create common misspellings of this word in an expansion set so that even if users misspell the word, they will still receive a result set of the documents that they are looking for.

A second method of using the thesaurus is to create the opposite of an expansion set, which is called a replacement set. In this scenario, you specify a term or set of terms that you don’t want users searching on, and instead, you replace it with terms that you do want users searching on. For example, let’s suppose that you don’t want users searching on the term “pyrotechnic”, but you do want them searching on the term “explosive”. In this scenario, you’ll create a replacement set that replaces the term “pyrotechnic” with the term “explosive”.

Common instances of when you’d want to do this include when replacing culturally insensitive words or words that would violate your human-resources policies. We know of one administrator that listed all the offensive words in a replacement set and replaced it with the title to the human resource (HR) policy manual. In this instance, when a user searched on an offensive term, the result set came back with a link to the HR manual. That’s a pretty good use of the thesaurus!

Believe it or not, another common use for replacement sets is for misspelled words. You can handle commonly misspelled words either through expansion or replacement sets, whichever method seems to best fit your needs and environment.

The third method of crafting the result set using the thesaurus is to use the weighting schemes. Weighting (or ranking) determines which documents will appear ahead of other documents in the result set. Terms are assigned a weight between 0.1 and 1.0, and these differences are absolute, not proportional. Therefore, 0.8 is not twice as “heavy” as 0.4, and 0.5 is just as much higher as 0.8 when compared to 0.4. So, select the terms that you want to weight, place them in the same expansion set, and then add weighting as illustrated in the default thesaurus file.

Finally, you can employ stemming, which means placing a term in an expansion set and then following that term with two stars, as in “run**”. What this will do is force a search in the index on all terms that begin with the three letters “r-u-n”. A best practice here is to use this method with caution, because you can get unintended effects in the result set from doing this. We suggest that you use a dictionary to start with the root word you want to stem and then follow that word down the dictionary list of words to see all the possible words that will be included in the search to ensure that the result set will be what the user really expects.

In our example the result set would include documents that contain the word “run”, but also the word “runaway”, “runabout”, “running”, and “runt”. Such a dissimilar set of words might not be desirable as part of an overall search query, so be sure to think this through before using it.

As you might have gathered by now, the use of the thesaurus is really more art than science, and more skill than technology. Because the default thesaurus is really a blank slate, you’ll need to use it over time to figure out how it can best serve your needs. A best practice here is to use a software product to capture the search queries entered into the Search Web Part, and then use that list to help you build an intelligent and useful thesaurus for your environment.

Using Multiple Thesaurus Files

During installation of the thesaurus files, a copy of the thesaurus files is saved to the <local_drive>\Program Files\SharePoint Portal Server\Data\Applications\Application UID\Config directory. These files can be used to specify thesaurus files that apply at the application level instead of at the server or server-farm level. For example, if SharePoint Portal Server and another application that is included in the portal site are installed on the same server, each can use different thesaurus files.

SharePoint Portal Server 2003 allows you to specify which attributes should appear in the Advanced Search Web Part by allowing you to configure a custom list in the drop-down list.

To configure Attribute Mapping for Advanced Search, click Manage Properties Of Crawled Documents and then drill down into the property you want to enable for advanced searching. On the Edit Property page, in the Search Options section, select the Include This Property In Advanced Search Options check box.

Note that when you select this check box, the Include This Property In The Content Index and Allow Property To Be Displayed check boxes are automatically selected and then dimmed so that you can’t clear them. In a sense, this saves us from ourselves because there would be no reason to select a property to appear in Advance Search and then not index the property or allow the property to be displayed.

Optionally, you can also have the property appear in item details in the result set by selecting the Display This Property In Item Details In Search Results check box. Doing this will ensure that this specific property is displayed in the search results.

Finally, you can tie alerts to modifications in the property. What this means is that even if the content of the document does not change, but only the values in this property do, alerts can be generated.

Summary

This chapter covered how you can manage external content via SharePoint Portal Server 2003 through the use of content sources that tie into SharePoint Portal Server 2003 content indexes. This chapter also covered managing and editing search scopes and discussed their significance in SharePoint Portal Server 2003 searches.