Book Excerpt - Chapter 16 Enterprise search and indexing architecture and administration
Updated: May 29, 2008
Applies To: Office SharePoint Server 2007
This article is an excerpt from Microsoft Office SharePoint Server 2007 Administrator’s Companion by Bill English and the Microsoft SharePoint Community Experts and property of Microsoft Press (ISBN-10: 978-0-7356-2282-1) copyright 2007, all rights reserved. No part of this chapter may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, electrostatic, mechanical, photocopying, recording, or otherwise—without the prior written permission of the publisher, except in the case of brief quotations embodied in critical articles or reviews.
One of the main reasons that you’ll consider purchasing Microsoft Office SharePoint Server 2007 is for the robust search and indexing features that are built in to it. These features allow you to crawl any type of content and provide improved relevance in search results. You’ll find these features to be some of the most compelling parts of this server suite.
For both SharePoint Server 2007 and Windows SharePoint Services 3.0, Microsoft is using a common search engine, Microsoft Search (mssearch.exe). This is welcome news for those of us who worked extensively in the previous versions of SharePoint. Microsoft Windows SharePoint Services 2.0 used the Microsoft SQL Server full-text engine, and SharePoint Portal Server 2003 used the MSSearch.exe (actually named SharePointPS-Search) engine. The problems this represented, such as incompatibility between indexes or having to physically move to the portal to execute a query against the portal’s index, have been resolved in this version of SharePoint Products and Technologies.
In this chapter, the discussion of the search and indexing architecture is interwoven with administrative and best practices discussions. Because this is a deep, wide, and complex feature, you’ll need to take your time to digest and understand both the strengths and challenges that this version of Microsoft Search introduces.
Understanding the Microsoft Vision for Search
The vision for Microsoft Search is straightforward and can be summarized in these bullet points:
Great results every time.
There isn’t much sense in building a search engine that will give substandard result sets. Think about it. When you enter a query term in other Internet-based search engines, you’ll often receive a result set that gives you 100,000 or more links to resources that match your search term. Often, only the first 10 to 15 results hold any value at all, rendering the vast majority of the result set useless. Microsoft’s aim is to give you a lean, relevant result set every time you enter a query.
Search integrated across familiar applications.
Microsoft is integrating new or improved features into well-known interfaces. Improved search functionality is no exception. As the SharePoint Server product line matures in the coming years, you’ll see the ability to execute queries against the index worked into many well-known interfaces.
Ability to index content regardless of where it is located.
One difficulty with SharePoint Portal Server 2003 was its inability to crawl content held in different types of databases and structures. With the introduction of the Business Data Catalog (BDC), you can expose data from any data source and then crawl it for your index. The crawling, exposing, and finding of data from nontraditional data sources (such as file servers, SharePoint sites, Web sites, and Microsoft Exchange public folders) will depend directly on your BDC implementation. Without the BDC, the ability to crawl and index information from any source will be diminished.
A scalable, manageable, extensible, and secure search and indexing product.
Microsoft has invested a large amount of capital into making its search engine scalable, more easily managed, extensible, and more secure. In this chapter, you’ll learn about how this has taken place.
As you can see, these are aggressive goals. But they are goals that, for the most part, have been attained in SharePoint Server 2007. In addition, you’ll find that the strategies Microsoft has used to meet these goals are innovative and smart.
Crawling Different Types of Content
One challenge of using a common search engine across multiple platforms is that the type of data and access methods to that data change drastically from one platform to another. Let’s look at four common scenarios.
Rightly or wrongly (depending on how you look at it), people tend to host their information on their desktop. And the desktop is only one of several locations where information can be saved. Frustrations often arise because people looking for their documents are unable find them because they can’t remember where they saved them. A strong desktop search engine that indexes content on the local hard drives is essential now in most environments.
Information that is crawled and indexed across an intranet site or a series of Web sites that comprise your intranet is exposed via links. Finding information in a site involves finding information in a linked environment and understanding when multiple links point to a common content item. When multiple links point to the same item, that tends to indicate that the item is more important in terms of relevance in the result set. In addition, crawling linked content, that through circuitous routes might link back to itself, demands a crawler that knows how deep and wide to crawl before not following available links to the same content. Within a SharePoint site, this can be more easily defined. We just tell the crawler to crawl within a certain URL namespace and, often, that is all we need to do.
In many environments, Line of Business (LOB) information that is held in dissimilar databases that represent dissimilar data types are often displayed via customized Web sites. In the past, crawling this information has been very difficult, if not impossible. But with the introduction of the Business Data Catalog (BDC), you can now crawl and index information from any data source. The use of the BDC to index LOB information will be important if you want to include LOB data into your index.
When searching for information in your organization’s enterprise beyond your intranet, you’re really looking for documents, Web pages, people, e-mail, postings, and bits of data sitting in disparate, dissimilar databases. To crawl and index all this information, you’ll need to use a combination of the BDC and other, more traditional types of content sources, such as Web sites, SharePoint sites, file shares, and Exchange public folders. Content sources is the term we use to refer to the servers or locations that host the content that we want to crawl.
Moving forward in your SharePoint deployment, you’ll want to strongly consider using the mail-enabling features for lists and libraries. The ability to include e-mail into your collaboration topology is compelling because so many of our collaboration transactions take place in e-mail, not in documents or Web sites. If e-mails can be warehoused in lists within sites that the e-mails reference, this can only enhance the collaboration experience for your users.
Nearly all the data on the Internet is linked content. Because of this, crawling Web sites requires additional administrative effort in setting boundaries around the crawler process via crawl rules and crawler configurations. The crawler can be tightly configured to crawl individual pages or loosely configured to crawl entire sites that contain DNS name changes.
You’ll find that there might be times when you’ll want to “carve out” a portion of a Web site for crawling without crawling the entire Web site. In this scenario, you’ll find that the crawl rules might be frustrating and might not achieve what you really want to achieve. Later in this chapter, we’ll discuss how the crawl rules work and what their intended function is. But it suffices to say here that although the search engine itself is very capable of crawling linked content, throttling and customizing the limitations of what the search engine crawls can be tricky.
Architecture and Components of the Microsoft Search Engine
Search in SharePoint Server 2007 is a shared service that is available only through a Shared Services Provider (SSP). In a Windows SharePoint Services 3.0-only implementation, the basic search engine is installed, but it will lack many components that you’ll most likely want to install into your environment.
Table 16-1 provides a feature comparison between the search engine that is installed with a Windows SharePoint Services 3.0—only implementation and a SharePoint Server 2007 implementation.
|Feature||Windows SharePoint Services 3.0||SharePoint Server 2007|
Content that can be indexed
Local SharePoint content
SharePoint content, Web content, Exchange public folders, file shares, Lotus Notes, Line of Business (LOB) application data via the BDC
Create Real Simple Syndication (RSS) from result set
The “Did You Mean….?” Prompt
Scopes based on managed properties
Customizable tabs in Search Center
People Search/Knowledge Network
Crawl information via the BDC
Application programming interfaces (APIs) provided
Query and Administration
The architecture of the search engine includes the following elements:
The term content source can sometimes be confusing because it is used in two different ways in the literature. The first way it is used is to describe the set of rules that you assign to the crawler to tell it where to go, what kind of content to extract, and how to behave when it is crawling the content. The second way this term is used is to describe the target source that is hosting the content you want to crawl. By default, the following types of content sources can be crawled (and if you need to include other types of content, you can create a custom content source and protocol handler):
SharePoint content, including content created with present and earlier versions
Exchange public folders
Any content exposed by the BDC
IBM Lotus Notes (must be configured before it can be used)
The crawler extracts data from a content source. Before crawling the content source, the crawler loads the content source’s configuration information, including any site path rules, crawler configurations, and crawler impact rules. (Site path rules, crawler configurations, and crawler impact rules are discussed in more depth later in this chapter.) After it is loaded, the crawler connects to the content source using the appropriate protocol handler and uses the appropriate iFilter (defined later in this list) to extract the data from the content source.
The protocol handler tells the crawler which protocol to use to connect to the content source. The protocol handler that is loaded is based on the URL prefix, such as HTTP, HTTPS, or FILE.
The iFilter (Index Filter) tells the crawler what kind of content it will be connecting to so that the crawler can extract the information correctly from the document. The iFilter that is loaded is based on the URL’s suffix, such as .aspx, .asp, or .doc.
The indexer stores the words that have been extracted from the documents in the full-text index. In addition, each word in the content index has a relationship set up between that word and its metadata in the property store (Shared Services Provider’s Search database in SQL Server) so that the metadata for that word in a particular document can be enforced in the result set. For example, if we’re discussing NTFS permissions, than the document may or may not appear in the result set based on the permissions for that document that contained the word in the query because all result sets are security-trimmed before they are presented to the user so that the user only sees links to document and sites to which the user already has permissions.
The property store is the Shared Services Provider’s (SSP) Search database in SQL Server that hosts the metadata on the documents that are crawled. The metadata includes NTFS and other permission structures, author name, data modified, and any other default or customized metadata that can be found and extracted from the document, along with data that is used to calculate relevance in the result set, such as frequency of occurrence, location information, and other relevance-oriented metrics that we’ll discuss later in this chapter under the section titled “Relevance Improvements.” Each row in the SQL table corresponds to a separate document in the full-text index. The actual text of the document is stored in the content index, so it can be used for content queries. For a Web site, each unique URL is considered to be a separate “document.”
We want to stress that you need both the index on the file system (which is held on the Index servers and copied to the Query servers) and the SSP’s Search database in order to successfully query the index.
The relationship between words in the index and metadata in the property store is a tight relationship that must exist in order for the result set to be rendered properly, if at all. If either the property store or the index on the file system is corrupted or missing, users will not be able to query the index and obtain a result set. This is why it is imperative to ensure that your index backups successfully back up both the index on the file system and the SSP’s Search database. Using the SharePoint Server 2007’s backup tool will back up the entire index at the same time and give you the ability to restore the index as well (several third-party tools will do this too).
But if you only backup the index on the file system without backing up the SQL database, then you will not be able to restore the index. And if you backup only the SQL Server database and not the index on the file system, then you will not be able to restore the index. Do not let your SQL Administrators or Infrastructure Administrators sway you on this point: in order to obtain a trustworthy backup of your index, you must use either a third-party tool written for precisely this job or the backup tool that ships with SharePoint Server 2007. If you use two different tools to backup the SQL property store and the index on the file system, it is highly likely that when you restore both parts of the index, you’ll find, at a minimum, the index will contain inconsistencies and your results will vary based on the inconsistencies that might exist from backing up these two parts of the index at different times.
When the crawler starts to crawl a content source, several things happen in succession very quickly. First, the crawler looks at the URL it was given and loads the appropriate protocol handler, based on the prefix of the URL, and the appropriate iFilter, based on the suffix of the document at the end of the URL.
The content source definitions are held in the Shared Services Provider Search SQL Server database and the registry. When initiating a crawl, the definitions are read from the registry because this gives better performance than reading them from the database. Definitions in the registry are synchronized with the database so that the backup/restore procedures can backup and restore the content source definitions. Never modify the content source definitions in the registry. This is not a supported action and should never be attempted.
Then the crawler checks to ensure that any crawler impact rules, crawl rules, and crawl settings are loaded and enforced. Then the crawler connects to the content source and creates two data streams out of the content source. First the metadata is read, copied, and passed to the Indexer plug-in. The second stream is the content, and this stream is also passed to the Indexer plug-in for further work.
All the crawler does is what we tell it to do using the crawl settings in the content source, the crawl rules (formerly known as site path rules in SharePoint Portal Server 2003) and crawler impact rules (formerly known as site hit frequency rules in SharePoint Portal Server 2003). The crawler will also not crawl documents that are not listed in the file types list nor will it be able to crawl a file if it cannot load an appropriate iFilter. Once the content is extracted, it is passed off to the Indexer plug-in for processing.
When the Indexer receives the two data streams, it places the metadata into the SSP’s Search database, which, as you’ll recall, is also called the property store. In terms of workflow, the metadata is first passed to the Archival plug-in, which reads the metadata and adds any new fields to the crawled properties list. Then the metadata is passed to the SSP’s Search database, or property store. What’s nice here is that the archival plug-in (formerly known as the Schema plug-in in SharePoint Portal Server 2003) automatically detects and adds new metadata types to the crawled properties list (formerly known as the Schema in SharePoint Portal Server 2003). It is the archival plug-in that makes your life as a SharePoint Administrator easier: you don’t have to manually add the metadata type to the crawled properties list before that metadata type can be crawled.
For example, let’s say a user entered a custom text metadata field in a Microsoft Office Word document named “AAA” with a value of “BBB.” When the Archival plug-in sees this metadata field, it will notice that the document doesn’t have a metadata field called “AAA” and will therefore create one as a text field. It then writes that document’s information into the property store. The Archival plug-in ensures that you don’t have to know in advance all the metadata that could potentially be encountered in order to make that metadata useful as part of your search and indexing services.
After the metadata is written to the property store, the Indexer still has a lot of work to do. The Indexer performs a number of functions, many of which have been essentially the same since Index Server 1.1 in Internet Information Services 4.0. The indexer takes the data stream and performs both word breaking and stemming. First, it breaks the data stream into 64-KB chunks (not configurable) and then performs word breaking on the chunks. For example, the indexer must decide whether the data stream that contains “nowhere” means “no where” or “now here.” The stemming component is used to generate inflected forms of a given word. For example, if the crawled word is “buy,” then inflected forms of the word are generated, such as “buys,” “buying,” and “bought.” After word breaking has been performed and inflection generation is finished, the noise words are removed to ensure that only words that have discriminatory value in a query are available for use.
Results of the crawler and indexing processes can be viewed using the log files that the crawler produces. We’ll discuss how to view and use this log later in this chapter.
Understanding and Configuring Relevance Settings
Generally speaking, relevance relates to how closely the search results returned to the user match what the user wanted to find. Ideally, the results on the first page are the most relevant, so users do not have to look through several pages of results to find the best result for their search.
The product team for SharePoint Server 2007 has added a number of new features that substantially improve relevance in the result set. The following sections detail each of these improvements.
Click distance refers to how far each content item in the result set is from an “authoritative” site. In this context, “sites” can be either Web sites or file shares. By default, all the root sites in each Web application are considered first-level authoritative.
You can determine which sites are designated to be authoritative by simply entering the sites or file shares your users most often visit to find information or to find their way to the information they are after. Hence, the logic is that the “closer” in number of clicks a site is to an authoritative site, the more relevant that site is considered to be in the result set. Stated another way, the more clicks it takes to get from an authoritative site to the content item, the less relevant that item is thought to be and the lower it will appear in the result set.
You will want to evaluate your sites over time to ensure that you’ve appropriately ranked sites that your users visit. When content items from more than one site appear in the result set, it is highly likely that some sites’ content will be more relevant to the user than other sites’ content. Use this three-tiered approach to explicitly set primary, secondary, and tertiary levels of importance to individual sites in your organization. SharePoint Server 2007 allows you to set primary (first level), secondary (second level), and tertiary (third level) sites, as well as sites that should never be considered authoritative. Determining which sites should be placed at which level is probably more art than science and will be a learning process over time.
To set authoritative sites, you’ll need to first open the SSP in which you need to work, click the Search Settings link, and then scroll to the bottom of the page and click the Relevance Settings link. This will bring you to the Edit Relevance Settings page, as illustrated in Figure 16-1.
Note that on this page, you can input any URL or file share into any one of the three levels of importance. By default, all root URLs for each Web application that are associated with this SSP will be automatically listed as most authoritative. Secondary and tertiary sites can also be listed. Pages that are closer (in terms of number of clicks away from the URL you enter in each box) to second-level or third-level sites rather than to the first-level sites will be demoted in the result set accordingly. Pages that are closer to the URLs listed in the Sites To Demote pane will be ranked lower than all other results in the result set.
Hyperlink Anchor Text
When you hover your mouse over a link, the descriptive text that appears is called anchor text. The hyperlink anchor text feature ties the query term or phase with that descriptive text. If there is a match between the anchor text and the query term, that URL is pushed up in the result set and made to be more relevant. Anchor text only influences rank and is not the determining factor for including a content item in the result set. Search indexes the anchor text from the following elements:
HTML anchor elements
Windows SharePoint Services link lists
Office SharePoint Portal Server listings
Office Word 2007, Office Excel 2007, and Office PowerPoint 2007 hyperlinks
URL Surf Depth
Important or relevant content is often located closer to the top of a site’s hierarchy, instead of in a location several levels deep in the site. As a result, the content has a shorter URL, so it’s more easily remembered and accessed by the user. Search makes use of this fact by looking at URL depth, or how many levels deep within a site the content item is located. Search determines this level by looking at the number of slash (/) characters in the URL; the greater the number of slash characters in the URL path, the deeper the URL is for that content item. As a consequence, a large URL depth number lowers the relevance of that content item.
If a query term matches a portion of the URL for a content item, that content item is considered to be of higher relevance than if the query term had not matched a portion of the content item’s URL. For example, if the query term is “muddy boots” and the URL for a document is http://site1/library/muddyboots/report.doc, because “muddy boots” (with or without the space) is part of the URL with an exact match, the report.doc will be raised in its relevance for this particular query.
Automatic Metadata Extraction
Microsoft has built a number of classifiers that look for particular kinds of information in particular places within Microsoft documents. When that type of information is found in those locations and there is a query term match, the document is raised in relevance in the result set. A good example of this is the title slide in PowerPoint. Usually, the first slide in a PowerPoint deck is the title slide that includes the author’s name. If “Judy Lew” is the query term and “Judy Lew” is the name on the title slide of a PowerPoint deck, that deck is considered more relevant to the user who is executing the query and will appear higher in the result set.
Automatic Language Detection
Documents that are written in the same language as the query are considered to be more relevant than documents written in other languages. Search determines the user’s language based on Accept-Language headers from the browser in use. When calculating relevance, content that is retrieved in that language is considered more relevant. Because there is so much English language content and a large percentage of users speak English, English is also ranked higher in search relevance.
File Type Relevance Biasing
Certain document types are considered to be inherently more important than other document types. Because of this, Microsoft has hard-coded which documents will appear ahead of other documents based on their type, assuming all other factors are equal. File type relevance biasing does not supersede or override other relevance factors. Microsoft has not released the file type ordering that it uses when building the result set.
Search administration is now conducted entirely within the SSP. The portal (now known as the Corporate Intranet Site) is no longer tied directly to the search and indexing administration. This section discusses the administrative tasks that you’ll need to undertake to effectively administrate search and indexing in your environment. Specifically, it discusses how to create and manage content sources, cons the crawler, set up site path rules, and throttle the crawler through the crawler impact rules. This section also discusses index management and provides some best practices along the way.
Creating and Managing Content Sources
The index can hold only that information that you have configured Search to crawl. We crawl information by creating content sources. The creation and configuration of a content source and associated crawl rules involves creating the rules that govern where the crawler goes to get content, when the crawler gets the content, and how the crawler behaves during the crawl.
To create a content source, we must first navigate to the Configure Search Settings page. To do this, open your SSP administrative interface and click the Search Settings link under the Search section. Clicking on this link will bring you to the Configure Search Settings page (shown in Figure 16-2).
Notice that you are given several bits of information right away on this page, including the following:
Number of items in the index
Number errors in the crawler log
Number of content sources
Number of crawl rules defined
Which account is being used as the default content access account
The number of managed properties that are grouping one or more crawled properties
Whether search alerts are active or deactivated
Current propagation status
This list can be considered a search administrator’s dashboard to instantly give you the basic information you need to manage search across your enterprise. Once you have familiarized yourself with your current search implementation, click the Content Sources link to begin creating a new content source. When you click this link, you’ll be taken to the Manage Content Sources page (shown in Figure 16-3). On this page, you’ll see a listing of all the content sources, the status of each content source, and when the next full and incremental crawls are scheduled.
Note that there is a default content source that is created in each SSP: Local Office SharePoint Server Sites. By default, this content source is not scheduled to run or crawl any content. You’ll need to configure the crawl schedules manually. This source includes all content that is stored in the sites within the server or server farm. You’ll need to ensure that if you plan on having multiple SSPs in your farm, only one of these default content sources is scheduled to run. If more than one are configured to crawl the farm, you’ll unnecessarily crawl your farm’s local content multiple times, unless users in different SSPs all need the farm content in their indexes, which would then beg the question as to why you have multiple SSPs in the first place.
If you open the properties of the Local Office SharePoint Server Sites content source, you’ll note also that there are actually two start addresses associated with this content source and they have two different URL prefixes: HTTP and SPS3. By default, the HTTP prefix will point to the SSP’s URL. The SPS3 prefix is hard-coded to inform Search to crawl the user profiles that have been imported into that SSP’s user profile database.
To create a new content source, click the New Content Source button. This will bring you to the Add Content Source dialog page (Figure 16-4). On this page, you’ll need to give the content source a name. Note that this name must be unique within the SSP, and it should be intuitive and descriptive—especially if you plan to have many content sources.
If you plan to have many content sources, it would be wise to develop a naming convention that maps to the focus of the content source so that you can recognize the content source by its name.
Notice also, as shown in the figure, that you’ll need to select which type of content source you want to create. Your selections are as follows:
This content source is meant to crawl SharePoint sites and simplifies the user interface so that some choices are already made for you.
This content source type is intended to be used when crawling Web sites.
This content source will use traditional Server Message Block and Remote Procedure Calls to connect to a share on a folder.
Exchange Public Folders
This content source is optimized to crawl content in an Exchange public folder.
Select this content source if you want to crawl content that is exposed via the Business Data Catalog.
You can have multiple start addresses for your content source. This improvement over SharePoint Portal Server 2003 is welcome news for those who needed to crawl hundreds of sites and were forced into managing hundreds of content sources. Note that while you can enter different types of start addresses into the start address input box for a give content source, it is not recommended that you do this. Best practice is to enter start addresses that are consistent with the content source type configured for the content source.
Assume you have three file servers that host a total of 800,000 documents. Now assume that you need to crawl 500,000 of those documents, and those 500,000 documents are exposed via a total of 15 shares. In the past, you would have had to create 15 content sources, one for each share. But today, you can create one content source with 15 start addresses and schedule one crawl and create one set of site path rules for one content source. Pretty nifty!
Planning your content sources is now easier because you can group similar content targets into a single content source. Your only real limitation is the timing of the crawl and the length of time required to complete the crawl. For example, performing a full crawl of blogs.msdn.com will take more than two full days. So grouping other blog sites with this site might be unwise.
The balance of the Add Content Source page (shown in Figure 16-5) involves specifying the crawl settings and the crawl schedules and deciding whether you want to start a full crawl manually.
The crawl settings instruct the crawler how to behave relative to depth and breadth given the different content source types. Table 16-2 lists each of these types and associated options.
|Type of crawl||Crawler setting options||Notes|
Crawl everything under the hostname for each start address
Crawl only the SharePoint site of each start address
This will crawl all site collections at this start address, not just the root site in the site collection. In this context, hostname means URL namespace.
This option includes new site collections inside a managed path.
Only crawl within the server of each start address
Only crawl the first page of each start address
Custom—specify page depth and server hops
In this context, Server means URL namespace (for example, contoso.msft).
This means that only a single page will be crawled.
Page depth” refers to page levels in a Web site hierarchy. Server hops refers to changing the URL namespace—that is, changes in the Fully Qualified Domain Name (FQDN) that occur before the first “/” in the URL.
The folder and all sub-folders of each start address.
The folder of each start address only
Exchange public folders
The folder and all subfolders of each start address
The folder of each start address only
What is evident here is that you’ll need a different start address for each public folder tree.
Business Data Catalog
Crawl entire Business Data Catalog
Crawl selected applications
The crawl schedules allow you to schedule both full and incremental crawls. Full index builds will treat the content source as new. Essentially, the slate is wiped clean and you start over crawling every URL and content item and treating that content source as if it has never been crawled before. Incremental index builds update new or modified content and remove deleted content from the index. In most cases, you’ll use an incremental index build.
You’ll want to perform full index builds in the following scenarios because only a full index build will update the index to reflect the changes in these scenarios:
Any changes to crawl inclusion/exclusion rules.
Any changes to the default crawl account.
Any upgrade to a Windows SharePoint Services site because an upgrade action deletes the change log and a full crawl must be initiated because there is no change log to reference for an incremental crawl.
Changes to .aspx pages.
When you add or remove an iFilter.
When you add or remove a file type.
Changes to property mappings will happen on a document-by-document as each affected document is crawled, whether the crawl is an incremental or full crawl. A full crawl of all content sources will ensure that document property mapping changes are applied consistently throughout the index.
Now, there are a couple of planning issues that you need to be aware of. The first has to do with full index builds, and the second has to do with crawl schedules. First, you need to know that subsequent full index builds that are run after the first full index build of a content source will start the crawl process and add to the index all the content items it finds. Only after the build process is complete will the original set of content items in the index be deleted. This is important to note because the index can be anywhere from 10 percent to 40 percent of the size of the content (also referred to as the corpus) you’re crawling, and for a brief period of time, you’ll need twice the amount of disk space that you would normally need to host the index for that content source.
For example, assume you are crawling a file server with 500,000 documents, and the total amount of disk space for these documents is 1 terabyte. Then assume that the index is roughly equal to 10 percent of the size of these documents, or 100 GB. Further assume that you completed a full index build on this file server 30 days ago, and now you want to do another full index build. When you start to run that full index build, several things will be true:
The new index will be created for that file server during the crawl process.
The current index of that file server will remain available to users for queries while the new index is being built.
The current index will not be deleted until the new index has successfully been built.
At the moment in time when the new index has successfully finished and the deletion of the old index for that file server has not started, you will be using 200 percent of disk space to hold that index.
The old index will be deleted item by item. Depending on the size and number of content items, that could take from several minutes to many hours.
Each deletion of a content item will result in a warning message for that content source in the Crawl Log. Even if you delete the content source, the Crawl Log will still display the warning messages for each content item for that content source. In fact, deleting the content source will result in all the content items in the index being deleted, and the Crawl Log will reflect this too.
The scheduling of when indexes should be run is a planning issue. “How often should I crawl my content sources?” The answer to this question is always the same: The frequency of content changes combined with the level of urgency for the updates to appear in your index will dictate how often you crawl the content. Some content—such as old, reference documents that rarely, if ever, change might be crawled once a year. Other documents, such as daily or hourly memo updates, can be crawled daily, hourly, or every 10 minutes.
Administrating Crawl Rules
Formerly known as site path rules, crawl rules help you understand how to apply additional instructions to the crawler when it crawls certain sites.
For the default content source in each SSP—the Local Office SharePoint Server Sites content source—Search provides two default crawl rules that are hard coded and can’t be changed. These rules are applied to every http://ServerName added to the default content source and do the following:
Exclude all .aspx pages within http://ServerName
Include all the content displayed in Web Parts within http://ServerName
For all other content sources, you can create crawl rules that give additional instructions to the crawler on how to crawl a particular content source. You need to understand that rule order is important, because the first rule that matches a particular set of content is the one that is applied. The exception to this is a global exclusion rule, which is applied regardless of the order in which the rule is listed. The next section runs through several common scenarios for applying crawl rules.
Do not use rules as another way of defining content sources or providing scope. Instead, use rules to specify more details about how to handle a particular set of content from a content source.
Specifying a Particular Account to Use When Crawling a Content Source
The most common reason people implement a crawl rule is to specify an account that has at least Read permissions (which are the minimum permissions needed to crawl a content source) on the content source so that the information can be crawled. When you select the Specify Crawling Account option (shown in Figure 16-6), you’re enabling the text boxes you use to specify an individual crawling account and password. In addition, you can specify whether to allow Basic Authentication. Obviously, none of this means anything unless you have the correct path in the Path text box.
Crawling Complex URLs
Another common scenario that requires a crawl rule is when you want to crawl URLs that contain a question mark (?). By default, the crawler will stop at the question mark and not crawl any content that is represented by the portion of the URL that follows the question mark. For example, say you want to crawl the URL http://www.contoso.msft /default.aspx?top=courseware. In the absence of a crawl rule, the portion of the Web site represented by “top=courseware” would not be crawled and you would not be able to index the information from that part of the Web site. To crawl the courseware page, you need to configure a crawl rule.
So how would you do this, given our example here? First, you enter a path. Referring back to Figure 16-6, you’ll see that all the examples given on the page have the wildcard character “*” included in the URL. Crawl rules cannot work with a path that doesn’t contain the “*” wildcard character. So, for example, http://www.contoso.msft would be an invalid path. To make this path valid, you add the wildcard character, like this: http://www.contoso.msft/*.
Now you can set up site path rules that are global and apply to all your content sources. For example, if you want to ensure that all complex URLs are crawled across all content sources, enter a path of http://*/* and select the Include All Items In This Path option plus the Crawl Complex URLs check box. That is sufficient to ensure that all complex URLs are crawled across all content sources for the SSP.
Crawler Impact Rules
Crawler Impact Rules are the old Site Hit Frequency Rules that were managed in Central Administration in the previous version; although the name has changed to Crawler Impact Rules, they are still managed in Central Administration in this version. Crawler Impact Rules is a farm-level setting, so whatever you decide to configure at this level will apply to all content sources across all SSPs in your farm. To access the Crawler Impact Rules page from the Application Management page, perform these steps.
Click Manage Search Service.
Click Crawler Impact Rules.
To add a new rule, click the Add Rule button in the navigation bar. The Add Crawler Impact Rule page will appear (shown in Figure 16-7).
You’ll configure the page based on the following information. First, the Site text box is really not the place to enter the name of the Web site. Instead, you can enter global URLs, such as http://* or http://*.com or http://*.contoso.msft. In other words, although you can enter a crawler impact rule for a specific Web site, sometimes you’ll enter a global URL. Notice that you then set a Request Frequency rule. There are really only two options here: how many documents to request in a single request and how long to wait between requests. The default behavior of the crawler is to ask for eight documents per request and wait zero seconds between requests. Generally, you input a number of seconds between requests to conserve bandwidth. If you enter one second, that will have a noticeable impact on how fast the crawler crawls the content sources affected by the rule. And generally, you’ll input a lower number of documents to process per request if you need to ensure better server performance on the part of the target server that is hosting the information you want to crawl.
SSP-Level Configurations for Search
When you create a new SSP, you’ll have several configurations that relate to how search and indexing will work in your environment. This section discusses those configurations.
First, you’ll find these configurations on the Edit Shared Services Provider configuration page (not illustrated), which can be found by clicking the Create Or Configure This Farm’s Shared Services link in the Application Management tab in Central Administration.
Click the down arrow next to the SSP you want to focus on, and click Edit from the context list.
Scroll to the bottom of the page (as shown in Figure 16-8), and you’ll see that you can select which Index Server will be the crawler for the all the content sources created within this Web application. You can also specify the path on the Index Server where you want the indexes to be held. As long as the server sees this path as a local drive, you’ll be able to use it. Remote drives and storage area network (SAN) connections should work fine as long as they are mapped and set up correctly.
Managing Index Files
If you’re coming from a SharePoint Portal Server 2003 background, you’ll be happy to learn that you have only one index file for each SSP in SharePoint Server 2007. As a result, you don’t need to worry anymore about any of the index management tasks you had in the previous version.
Having said that, there are some index file management operations that you’ll want to pay attention to. This section outlines those tasks.
The first big improvement in SharePoint Server 2007 is the Continuous Propagation feature. Essentially, instead of copying the entire index from the Index server to the Search server (using SharePoint Portal Server 2003 terminology here) every time a change is made to that index, now you’ll find that as information is written to the Content Store on the Search server (using SharePoint Server 2007 terminology now), it is continuously propagated to the Query server.
Continuous Propagation Continuous propagation is the act of ensuring that all the indexes on the Query servers are kept up to date by copying the indexes from the Index servers. As the indexes are updated by the crawler, those updates are quickly and efficiently copied to the Query servers. Remember that users query the index sitting on the Query server, not the Index server, so the faster you can update the indexes on the Query server, the faster you’ll be able to give updated information to users in their result set. Continuous propagation has the following characteristics:
Propagation uses the NetBIOS name of query servers to connect. Therefore, it is not a best practice to place a firewall between your Query server and Index server in SharePoint Server 2007 due to the number of ports you would need to open on the firewall.
Resetting Index Files
Resetting the index file is an action you’ll want to take only when necessary. When you reset the index file, you completely clean out all the content and metadata in both the property and content stores. To repopulate the index file, you need to re-crawl all the content sources in the SSP. These crawls will be full index builds, so they will be both time consuming and resource intensive. The reason that you would want to reset the index is because you suspect that your index has somehow become corrupted, perhaps due to a power outage our power supply failure and needs to be rebuilt.
Troubleshooting Crawls Using the Crawl Logs
If you need to see why the crawler isn’t crawling certain documents or certain sites, you can use the crawl logs to see what is happening. The crawl logs can be viewed on a per-content-source basis. They can be found by clicking on the down arrow for the content source in the Manage Content Sources page and selecting View Crawl Log to open the Crawl Log page (as shown in Figure 16-9). You can also open the Crawl Log page by clicking the Log Viewer link in the Quick Launch bar of the SSP team site.
After this page is opened, you can filter the log in the following ways:
By content source
By status type
By last status message
The status message of each document appears below the URL along with a symbol indicating whether or not the crawl was successful. You can also see, in the right-hand column, the date and time that the message was generated.
There are three possible status types:
The crawler was able to successful connect to the content source, read the content item, and pass the content to the Indexer.
The crawler was able to connect to the content source and tried to crawl the content item, but it was unable to for one reason or another. For example, if your site path rules are excluding a certain type of content, you might receive the following error message (note that the warning message uses the old terminology for crawl rules):
The specified address was excluded from the index. The site path rules may have to be modified to include this address.
The crawler was unable to communicate with the content source. Error messages might say something like this:
The crawler could not communicate with the server. Check that the server is available and that the firewall access is configured correctly.
Another very helpful element in the Crawl Log page (refer back to Figure 16-9 if needed) is the Last Status Message drop-down list. The list that you’ll see is filtered by which status types you have in focus. If you want to see all the messages that the crawler has produced, be sure to select All in the Status Type drop-down list. However, if you want to see only the Warning messages that the crawler has produced, select Warning in the Status Type drop-down list. Once you see the message you want to filter on, select it and you’ll see the results of all the crawls within the date range you’ve specified appear in the results list. This should aid troubleshooting substantially. This feature is very cool.
If you want to get a high-level overview of the successes, warnings, and error messages that have been produced across all your content sources, the Log Summary view of the Crawl Log page is for you. To view the log summary view, click on the Crawl Logs link from the Configure Search Settings page. The summary view should appear. If it does not, click the Log Summary link in the left pane and it will appear (as shown in Figure 16-10).
Each of the numbers on the page represents a link to the filtered view of the log. So if you click on one of the numbers in the page, you’ll find that the log will have already filtered the view based on the status type without regard to date or time.
Working with File Types
The file type inclusions list specifies the file types that the crawler should include or exclude from the index. Essentially, the way this works is that if the file type isn’t listed on this screen, search won’t be able to crawl it. Most of the file types that you’ll need are already listed along with an icon that will appear in the interface whenever that document type appears.
To manage file types, click the File Type Inclusions link on the Configure Search Settings page. This will bring you to the Manage File Types screen, as illustrated in Figure 16-11.
To add a new file type, click the New File Type button and enter the extension of the file type you want to add. All you need to enter are the file type’s extension letters, such as “pdf” or “cad.” Then click OK. Note that even though the three-letter extensions on the Mange File Types page represent a link, when you click the link, you won’t be taken anywhere.
Adding the file type here really doesn’t buy you anything unless you also install the iFilter that matches the new file type and the icon you want used with this file type. All you’re doing on this screen is instructing the crawler that if there is an iFilter for these types of files and if there is an associated icon for these types of files, then go ahead and crawl these file types and load the file’s icon into the interface when displaying this particular type of file.
Third-party iFilters that need to be added here will usually supply you with a .dll to install into the SharePoint platform, and they will usually include an installation routine. You’ll need to ensure you’ve installed their iFilter into SharePoint in order to crawl those document types. If they don’t supply an installation program for their iFilter, you can try running the following command from the command line:
regsvr32.exe <path\name of iFilter .dll>
This should load their iFilter .dll file so that Search can crawl those types of documents. If this command line doesn’t work, contact the iFilter’s manufacturer for information on how to install their iFilter into SharePoint.
To load the file type’s icon, upload the icon (preferably a small .gif file) to the drive:\program files\common files\Microsoft shared\Web server extensions\12\template\images directory. After uploading the file, write down the name of the file, because you’ll need to modify the docicon.xml file to include the icon as follows:
<Mapping Key="<doc extension>" Value="NameofIconFile.gif"/>
After this, restart your server and the icon should appear. In addition, you should be able to crawl and index those file types that you’ve added to your SharePoint deployment. Even if the iFilter is loaded and enabled, if you delete the file type from the Manage File Types screen, search will not crawl that file type. Also, if you have multiple SSPs, you’ll need to add the desired file types into each SSP’s configurations, but you’ll only need to load the .dll and the icon one time on the server.
Creating and Managing Search Scopes
A search scope provides a way to logically group items in the index together based on a common element. This helps users target their query to only a portion of the overall index and gives them a more lean, relevant result set. After you create a search scope, you define the content to include in that search scope by adding scope rules, specifying whether to include or exclude content that matches that particular rule. You can define scope rules based on the following:
You can create and define search scopes at the SSP level or at the individual site-collection level. SSP-level search scopes are called shared scopes, and they are available to all the sites configured to use a particular SSP. Search scopes can be built off of the following items:
Any specific URL
A file system folder
Exchange public folders
A specific host name
A specific domain name
Managed properties are built by grouping one or more crawled properties. Hence, there are really two types of properties that form the Search schema: crawled properties and managed properties. The crawled properties are properties that are discovered and created “on the fly” by the Archiving plug-in. When this plug-in sees new metadata that it has not seen before, it grabs that metadata field and adds the crawled property to the list of crawled properties in the search schema. Managed properties are properties that you, the administrator, create.
The behavior choices are to include any item that matches the rule, require that every item in the scope match this rule, or exclude items matching this rule.
Items are matched to their scope via the scope plug-in during the indexing and crawl process. Until the content items are passed through the plug-in from a crawl process, they won’t be matched to the scope that you’ve created.
Creating and Defining Scopes
To create a new search scope, you’ll need to navigate to the Configure Search Settings page, and then scroll down and click the View Scopes link. This opens the View Scopes page, at which point you can click New Scope. On the Create Scope page (shown in Figure 16-12), you’ll need to enter a title for the scope (required) and a description of the scope (optional). The person creating the scope will be the default contact for the scope, but a different user account can be entered if needed. You can also configure a customized results page for users who use this scope, or you can leave the scope at the default option to use the default search results page. Configure this page as needed, and then click OK. This procedure only creates the scope. You’ll still need to define the rules that will designate which content is associated with this scope.
After the scope is created on the View Scopes page, the update status for the scope will be “Empty – Add Rules.” The “Add Rules” will be a link to the Add Scope Rule page (shown in Figure 16-13), where you can select the rule type and the behavior of that rule.
Each rule type has its own set of variables. This section discusses the rule types in the order in which they appear in the interface.
Web Address Scope Type
First, the Web Address scope type will allow you to set scopes for file system folders, any specific URL, a specific host name, a specific domain name, and Exchange public folders.
Here are a couple of examples. Start with a folder named “Docs” on a server named “Search1.” Suppose that you have the following three folders inside Docs: “red”, “green” and “blue.” You want to scope just the docs in the green folder. The syntax you would enter, as illustrated in Figure 16-14, would be file://search1/docs/green. You would use the Universal Naming Convention (UNC) sequence in a URL format with the “file” prefix. Even though this isn’t mentioned in the user interface (UI), it will work.
Another example is if you want all content sources related to a particular domain, such as contoso.msft, to be scoped together. All you have to do is click to select the Domain Or Subdomain option and type contoso.msft.
Property Query Scope Type
The Property Query scope type allows you to add a managed property and make it equal to a particular value. The example in the interface is Author=John Doe. However, the default list is rather short and likely won’t have the managed property you want to scope in the interface. If this is the case for you, you’ll need to configure that property to appear in this interface. To do this, navigate to the Managed Properties View by clicking the Document Managed Properties link on the Configure Search Settings page. This opens the Document Property Mappings page (shown in Figure 16-15).
On this page, there are several columns, including the property name, the property type, whether or not it can be deleted, the actual mapping of the property (more on this in a moment), and whether or not it can be used in scopes. All you need to do to get an individual property to appear in the Add Property Restrictions drop-down list on the Add Scope Rule page is edit the properties of the managed property and then select the Allow This Property To Be Used In Scopes check box.
If you need to configure a new managed property, click New Managed Property (refer back to Figure 16-15), and you will be taken to the New Managed Property page (Figure 16-16). You can configure the property’s name, its type, its mappings, and whether it should be used in scope development.
In the Mappings To Crawled Properties section, you can look at the actual schema of your deployment and select an individual or multiple metadata to group together into a single managed property. For example, assume you have a Word document with a custom metadata labeled AAA with a value of BBB. After crawling the document, if you click Add Mapping, you’ll be presented with the Crawled Property Selection dialog box. For ease of illustration, select the Office category from the Select A Category drop-down list. Notice that in the Select A Crawled Property selection box, the AAA(Text) metadata was automatically added by the Archival plug-in. (See Figure 16-17.)
Once you click OK in the Crawled Property Selection dialog box, you’ll find that this AAA property appears in the list under the Mappings To Crawled Properties section. Then select the Allow This Property To Be Used In Scopes check box (property here refers to the Managed Property, not the crawled property), and you’ll be able to create a scope based on this property.
If you want to see all the crawled properties, from the Document Property Mappings page, you can select the Crawled Properties link in the left pane. This takes you to the various groupings of crawled property types. (See Figure 16-18.) Each category name is a link. When invoked, the link gives you all the properties that are associated with that category. There is no method of creating new categories within the interface.
Content Source and All Content Scopes
The Content Source Scope rule type allows you to tether a scope directly to a content source. An example of when you would want to do this might be when you have a file server that is being crawled by a single content source and users want to be able to find files on that files server without finding other documents that might match their query. Anytime you want to isolate queries to the boundaries of a given content source, select the Content Source Rule type.
The All Content rule type is a global scope that will allow you to query the entire index. You would make this selection if you’ve removed the default scopes and need to create a scope that allows a query of the entire index.
Thus far, we’ve been discussing what would be known as Global Scopes, which are scopes that are created at the SSP level. These scopes are considered to be global in nature because they are available to be used across all the site collections that are associated with the SSP via the Web applications. However, creating a scope at the SSP level does you no good unless you enable that scope to be used at the site-collection level.
Remember that scopes created at the SSP level are merely available for use in the site collections. They don’t show up by default until the Site Collection Administration has manually chosen them to be displayed and used within the site collection.
Site Collection Scopes
Scopes that are created at the SSP level can be enabled or disabled at the site-collection level. This gives site-collection administrators some authority over which scopes will appear in the search Web Parts within their site collection. The following section describes the basic actions to take to enable a global scope at the site-collection level.
First, open the Site Collection administration menu from the root site in your site collection. Then click the Search Scopes link under the Site Collection Administration menu to open the View Scopes page. On this page, you can create a new scope just for this individual site collection and/or create a new display group in which to display the scopes that you do create. The grouping gives the site collection administrator a method of organizing search scopes. The methods used to create a new search scope at the site-collection level are identical to those used at the SSP level.
To add a scope created at the SSP level, click New Display Group (as shown in Figure 16-19), and then you’ll be taken to the Create Scope Display Group page (shown in Figure 16-20).
On the Create Scope Display Group page, although you can see the scopes created at the SSP level, it appears they are unavailable because they are grayed out. However, when you select the Display check box next to the scope name, the scope is activated for that group. You can also set the order in which the scopes will appear from top to bottom.
Your configurations at the site-collection level will not appear until the SSP’s scheduled interval for updating scopes is completed. You can manually force the scopes to update by clicking the Start Update Now link on the Configure Search Settings page in the SSP.
If you want to remove custom scopes and the Search Center features at the site-collection level, you can do so by clicking the Search Settings link in the Site Collection Administration menu and then clearing the Enable Custom Scopes And Search Center Features check box. Note also that this is where you can specify a different page for this site collection’s Search Center results page.
Removing URLs from the Search Results
At times, you’ll need to remove individual content items or even an entire batch of content items from the index. Perhaps one of your content sources has crawled inappropriate content, or perhaps there are just a couple of individual content items that you simply don’t want appearing in the result set. Regardless of the reasons, if you need to remove search results quickly, you can do so by navigating to the Search Result Removal link in the SSP under the Search section.
When you enter a URL to remove the results, a crawl rule excluding the specific URLs will also be automatically created and enforced until you manually remove the rule. Deleting the content source will not remove the crawl rule.
Use this feature as a way to remove obviously spurious or unwanted results. But don’t try to use this feature to “carve out” large portions of Web sites unless you have some time to devote to finding each unwanted URL on that page.
Understanding Query Reporting
Another feature that Microsoft has included in Search is the Query Reporting tool. This tool is automatically built in to the SSP and gives you the opportunity to view all the search results across your SSP. Use this information to help you understand better what scopes might be useful to your users and what searching activities they are engaging in. There are two types of reports you can view. The first is the Search Queries Report, which includes basic bar graphs and pie charts. (See Figure 16-21.) On this page, you can read the following types of reports:
Queries over previous 30 days
Queries over previous 12 months
Top query origin site collections over previous 30 days
Queries per scope over previous 30 days
Top queries over previous 30 days
These reports can show you which group of users is most actively using the Search feature, the most common queries, and the query trends for the last 12 months. For each of these reports, you can export the data in either a Microsoft Excel or Adobe Acrobat (PDF) format.
In the left pane, you can click the Search Results link, which will invoke two more reports (as shown in Figure 16-22):
Search results top destination pages
Queries with zero results
These reports can help you understand where your users are going when they execute queries and inform not only your scope development, but also end-user training about what queries your users should use to get to commonly accessed locations and which queries not to use.
The Client Side of Search
End users will experience Search from a very different interface than has been discussed thus far in this chapter. They will execute queries and view query results within the confines of the new Search Center.
The Search Center is a SharePoint site dedicated to help the end user search the index. It includes several components, each responsible for a specific search task. The default setup for the Search Center includes the following components:
Search Center Home
People Search Results
Search Navigation tab
You can customize the Search Center by modifying these components or by creating your own version of these components. Because they are merely Web Parts, you can add them to the Search Center to extend its functionality or remove them to focus the Search Center’s functionality.
Executing Queries to Query the Index
Search supports two types of queries: keyword and SQL. This section discusses only keyword search syntax because the SQL Server syntax is designed to be used by developers creating custom solutions.
You can use any of the following as keywords:
Phrase (two or more words enclosed in quotation marks)
Prefix (includes a part of a word, from the beginning of the word)
There are three types of search terms that you can use: Simple, Included, and Excluded. Table 16-4 describes your options.
Search query word with no special requirements
Search query word that must be in the content items returned in the result set
Keyword that must not be in the content items that are returned in the result set
The inclusion “+” mark means that each content item in the result set must contain at least this word. For example, if you were to search on “muddy boots” (without the quotation marks), you would see documents in the result set that contained the word muddy, the word boots, and both muddy and boots. The point is that if you wanted all documents that have the word muddy plus documents that have only muddy and boots but not muddy and shoes, your search phrase would be “muddy + boots” (without the quotation marks). By the same token, if you wanted to exclude certain documents from the result set because of a certain word, you would use the “—“ sign. For example, if you’re looking for any type of boot except muddy, you would enter the query “boots — muddy” (without the quotation marks).
The query terms are not case sensitive. Searching on “windows” is the same as “windOWS.” However, the thesaurus is case sensitive, by default, so if you enter “Windows” as a word in an expansion set and the user enters “windows” as a query word, the expansion set will not be invoked. If you want to turn off case sensitivity in the thesaurus, enter the following command at the beginning of the thesaurus file: <case><caseflag="false"></case>
Query terms are also not accent sensitive, so searching for “resume” will return results for “résumé.” However, you can turn on diacritical marks using the stsadm.exe command, such as: stsadm -o osearchdiacriticsensitive
After a query is executed against the index, the browser is redirected automatically to the Search Results page. The results page can potentially contain the following types of results:
High confidence results
Keyword and Best Bet results
The layout of the information on the results page is a direct result of combining the search Web Parts for the results page. Because the results are passed from the index to the search results page in XML format, many of these Web Parts use XSLT to format the results. This is why you’ll need to enter additional XSLT commands if you want to perform certain actions, such as adding another property to the Advanced Search drop-down list.
Adding Properties to Advanced Search in SharePoint Server 2007
To add a property to the advanced search Web Part, perform these steps:
Navigate to the results page and then click Edit Page from the Site Actions menu.
Open the edit menu for the search Web Part, and select Modify Shared Web Part to open the Web Part property pane.
Expand the Miscellaneous section in the properties pane, and find the property called “properties.” You’ll find an XML string that allows you to define what properties will be displayed in the advanced search. Best practice here is to copy the string to NotePad for editing.
Edit the XML string, and save it back into the property. You can save the XML in the format shown below. This XML is copied directly from the Web Part. You’ll need a profile property in the schema for the XML to hold any real value.
<Properties> <Property Name="Department" ManagedName="Department" ProfileURI="urn:schemas -microsoft-com:sharepoint:portal:profile:Department"/> <Property Name="JobTitle" ManagedName="JobTitle" ProfileURI="urn:schemas-microsoft-com:sharepoint:portal:profile:Title"/> <Property Name="Responsibility" ManagedName="Responsibility" ProfileURI="urn :schemas-microsoft-com:sharepoint:portal:profile:SPS-Responsibility"/> <Property Name="Skills" ManagedName="Skills" ProfileURI="urn:schemas-microsoft-com:sharepoint:portal:profile:SPS-Skills"/> <Property Name="QuickLinks" ManagedName="QuickLinks" ProfileURI="urn:schemas -microsoft-com:sharepoint:portal:profile:QuickLinks"/> </Properties>
The elements that you’ll need to pay attention to are as follows:
Profile URI (Uniform Resource Identifier)
If you look at the URN (Uniform Resource Name) string carefully, you’ll see that the profile name is being pulled out of the profile URN. This is why you’ll need a profile property in the schema before this XML will have any real effect.
URI, URN, and URL: Brief Overview URIs, URNs, and URLs play an important, yet quiet, role in SharePoint Server 2007. A Uniform Resource Identifier (URI) provides a simple and extensible means for identifying a resource uniquely on the Internet. Because the means of identifying each resource are unique, no other person, company, or organization can have identical identifiers of their resources on the Internet.
The identifier can either be a “locator” (URL) or a “name” (URN). The URI syntax is organized hierarchically, with components listed in order of increasing granularity from left to right. For example, referring back to the XML data for the advanced search Web Part, we found that Microsoft had at least this URN: “urn:schemas-microsoft-com:sharepoint:portal:profile:SPS-Responsibility"
As you move from left to right, you move from more general to more specific, finally arriving at the name of the resource. No other resource on the Internet can be named exactly the same as the sps-responsibility resource in SharePoint. The hierarchical characteristic of the naming convention means that governance of the lower portions of the namespace is delegated to the registrant of the upper portion of the namespace. For example, the registered portion of the URN we’re using in our running example is “schemas-microsoft-com.” The rest of the URN is managed directly by Microsoft, not the Internet registering authority.
You will find URIs, URLs, and URNs throughout SharePoint and other Microsoft products. Having a basic understanding of these elements will aid your administration of your SharePoint deployment.
Modifying Other Search Web Parts
When you modify any one of the search Web Parts, you’ll notice that a publish toolbar appears. This toolbar enables you to modify this page without affecting the current live page. You can then publish this page as a draft for testing before going live.
Server Name Mappings
Server name mappings are crawl settings you can configure to override how server names and URLs are displayed or accessed in the result set after content has been included in the index. For example, you can configure a content source to crawl a Web site via a file share path, and then create a server name mapping entry to map the file share to the Web site’s URL. Another way to look at this feature is that it gives you the ability to mask internal file server names with external names so that your internal naming conventions are not revealed in the result set.
The thesaurus is a way to manually force or deny certain types of query terms at the time the user enters a query in the Search box. Using the thesaurus, you can implement expansion sets, replacement sets, weighting, and stemming. This section focuses on the expansion and replacement sets.
The thesaurus is held in an XML file, which is located in the drive:\program files\office sharepoint server\data\ directory and has the format of TS<XXX>.XML, where XXX is the standard three-letter code for a specific language. For English, the file name is tsenu.xml.
Here are the contents of the file in its default form:
<XML ID="Microsoft Search Thesaurus"> <thesaurus xmlns="x-schema:tsSchema.xml"> <expansion> <sub weight="0.8">Internet Explorer</sub> <sub weight="0.2">IE</sub> <sub weight="0.9">IE5</sub> </expansion> <replacement> <pat>NT5</pat> <pat>W2K</pat> <sub weight="1.0">Windows 2000</sub> </replacement> <expansion> <sub weight="0.5">run**</sub> <sub weight="0.5">jog**</sub> </expansion> </thesaurus> --> </XML>
There are two parts to the code: an expansion set and a replacement set.
You use expansion sets to force the expansion of certain query terms to automatically include other query terms. For example, you could do this when a product name changes but the older documents are still relevant, if acronyms are commonly used as query terms, or when new terms arise that refer to other terms, such as slang or industry-specific use of individual words.
If a user enters a specified word, other hits that match that word’s configured synonyms will also be displayed. For instance, if a user searches on the word “car”, you can configure the thesaurus to force a search on the word “sedan” as a synonym for the word “car” so that the result set will include content items that include the word “car” but also the word “sedans”, whether or not those “sedan” content items also mention the word “car.”
For example, to use the car illustration, you create the following code:
<XML ID="Microsoft Search Thesaurus"> <thesaurus xmlns="x-schema:tsSchema.xml"> <expansion> <sub>car</sub> <sub>sedan</sub> </expansion> </thesaurus>
You can have more than two terms in the expansion set, and the use of any term in the set will invoke the expansion of the query to include all the other terms in the expansion set.
If you want multiple expansion sets created—say, one for “car” and the other for “truck”— your code would look like this:
<XML ID="Microsoft Search Thesaurus"> <thesaurus xmlns="x-schema:tsSchema.xml"> <expansion> <sub>car</sub> <sub>sedan</sub> <sub>automobile</sub> </expansion> <expansion> <sub>truck</sub> <sub>pickup truck</sub> <sub>SUV</sub> </expansion> </thesaurus>
You can see how each expansion set is its own set of synonyms. This file can be as long as you want, and expansion sets need not be topically similar.
You can use the thesaurus to create a replacement set of words by specifying an initial word or pattern of query terms that will be replaced by a substitution set of one or more words. For example, you could create a replacement set that specifies the pattern “book writer” with the substitution “author” or “wordsmith.” In this example, when a user executes a query against the phrase “book writer,” the result set returns documents that have the words “author” and “wordsmith,” but not documents that contain the phrase “book writer.”
You do this is to ensure that commonly misspelled words are correctly spelled in the actual query. For example, the word “chrysanthemum” is easily misspelled, so placing various misspellings into a replacement set might help your users get the result set they’re looking for even though they might not be able to reliably spell the query term. Another example of using the replacement set is for product name changes where the old documents to the old product line are not needed any more or a person’s name has changed.
Your code would look like this:
<XML ID="Microsoft Search Thesaurus"> <thesaurus xmlns="x-schema:tsSchema.xml"> <replacement> <pat>book writer</pat> <sub>author</sub> <sub>wordsmith</sub> </replacement> </thesaurus>
Creating replacement sets for each misspelling is more time consuming, but it is also more accurate and helps those who are “spelling-challenged” to get a better result set.
So how can you use this? Let’s assume that you have a product-line name change. Use an expansion set to expand searches on the old and new names if both sets of documents are relevant after the name change. If the documents referring to the old name are not relevant, use a replacement set to replace queries on the old name with the new name.
Noise Word File
The noise word file, by default, contacts prepositions, adjectives, adverbs, articles, personal pronouns, single letters, and single numbers. You’ll want to place any additional words in the noise word file that will not hold any discriminatory value in your environment. Further examples of words that don’t highly discriminate between documents in your environment include your company name and names of individuals who appear often in documents or Web pages.
Keywords are words or phrases that site administrators have identified as important. They provide a way to display additional information and recommended links on the initial results page that might not otherwise appear in the search results for a particular word or phrase. Keywords are a method to immediately elevate a content item to prominence in the result set simply by associating a keyword with the content item. The content item, in this context, is considered a Best Bet. Best Bets are items that you want to appear at the top of the result set regardless of what other content items appear in the result set. For example, you could make the URL to the human resource policy manual a Best Bet so that anytime a user queries “human resources”, the link to the policy manual appears at the top of the result set.
Keywords are implemented at the site-collection level. You’ll create a keyword and then give it one or more synonyms. As part of creating the keyword, you’ll need to enter at least one Best Bet. After you create the keyword, add the synonym. After associating at least one Best Bet, you’ll find that when you search on the synonym, the Best Bet will appear in the Best Bet Web Part in the right-hand portion of the results page (by default).
Here is an example. Create a keyword by clicking the Search Keywords link in the Site Collection Administration menu. Then click the Add Keyword button. In the Keyword Phrase text box, type Green, and in the Synonym text box, type Color. Next, associate the keyword and the synonym with the Green folder in the Docs share as a Best Bet. (See Figure 16-23.)
After doing this, when you search on the word “color,” you see the Green folder appear in the Best Bet Web Part to the right of the core result set. (See Figure 16-24.)
Remember, Best Bets are merely a link to the information that is especially relevant to the keyword or its synonym. Be sure to look through your reports to find terms that are being queried many times, with users going to the same location many times. This is an indication that these terms can be grouped into a keyword with synonyms and these destinations become the Best Bet.
Working with the Result Set
The result set can be modified and managed in a number of ways. It will be impossible to fully cover every aspect of each Web Part in the result set, so this section highlights some of the more important configurations.
First, it is possible that there will be times when a user queries to find a document and receives a separate listing in the result set for each document in the version history of a document lineage. If this happens, try crawling the document library using an account with read permissions to the document library that isn’t the same as the application pool account.
The application pool account has pervasive access to content in a way that is not displayed in the user interface and is not configurable by the site administrator. Regardless of the type of versioning that is turned on, if you crawl that library using the application pool account, all versions in the history of that document will be crawled and indexed and may be displayed in the result set for your users.
Secondly, it is important for search administrators (and anyone modifying the results page) to grasp is that a major portion of the configurations are pushed into the Web Part properties rather than being given links on an Administration menu page. This makes it a bit more difficult to remember where to go when trying to manage an individual element on the page. Just remember, you’re really trying to manage the Web Part, not the page itself. You modify Web Parts by clicking Edit Page under Site Actions. The page will immediately be placed into edit mode. Your actions from this point forward will not be seen by your users until you successfully publish the page.
The first thing that you’ll notice you can do is add more tabs across the top of the page. If you click on the Add New Tab link (shown in Figure 16-25), you’re taken to the Tabs In Search Results: New Item page. On this page, you can reference an existing page, enter a new tab name for that page, and enter a tooltip that will pop up when users hover their mouse over the tab (as shown in Figure 16-26). Remember, you’re not creating a new Web page at this location. You’re merely referencing a page that you’ve already created in the Pages library. You do this to map a new search results page with one or more search scopes.
Receiving Notifications from Search Results
When users receive a result set from Search, they have the ability to continue receiving notifications from search results based on their query. If they like the results of the query they’ve executed and expect to re-execute the query multiple times to stay informed about new or modified information that matches the query, users can choose to either create an alert based on their query or set up a Real Simple Syndication (RSS) feed to the query.
By using RSS, users can stay updated about new information in individual lists or libraries. The RSS feed will be automatically added to their Outlook 2007 client. (Earlier versions of Outlook do not support this feature.) In addition, to run the RSS client successfully, users will need to download the Microsoft desktop search engine. The desktop search engine will not be automatically installed: you need to install it manually.
If you need to remove the RSS link feature, first generate a result set and then under the Site Actions menu, click Edit Page. Navigate to the Search Actions Link Web Part, click Edit, and then click Modify Shared Web Part. Under the Search Results Action Links list, clear the Display “RSS” Link check box.
Best Practices Clear the Display “RSS” Link check box until you have deployed the Microsoft Desktop Search Engine. There is no sense in giving your users an option in the interface if they can’t use it.
Customizing the Search Results Page
By default, the following Web Parts are on the Search Center’s results page:
Search Action Links
Search Best Bets
Search High Confidence Results
Search Core Results
This section covers the management of the more important Web Parts individually. All of these Web Parts are managed by clicking the Edit button in the Web Part (refer back to Figure 16-25) and then selecting Modify Shared Web Part in the drop-down menu list.
This is the Web Part that will allow you to select whether or not scopes appear in the drop-down list when executing a query. The Dropdown Mode drop-down list in the Scope Dropdown section offers you several choices. You can completely turn off scopes or ensure that contextual scopes either appear or don’t appear. (See Figure 16-27.)
The “s” parameter can be used when the scopes drop-down is hidden. Note that you can enter the scope in the URL, if needed, as follows:
But if the scopes drop-down is hidden, the “s” parameter lets you indicate whether you want to default to the “contextual” scope, for example, “this site” or re-use the scope specified by a search box on the originating page, which is passed in the ‘s’ parameter.
This is the mode used within the search center, allowing for the search scope to be carried through across tabs. It allows you to construct a user interface where some tabs make use of the ‘s’ parameter and others don’t, but the parameter is preserved as the user navigates through the tabs.
For example, let’s assume a user picks the scope “Northwest.” After executing the query, the user is presented with the "northwest" results of the query in the search results page in the search center. On the search results page, the search box there has no scopes dropdown list displayed. So the user modifies the query slightly and re-executes the search. The search box re-uses the same scope for the second query, because it is specified in the ‘s’ parameter.
In addition, you can enter a label to the left of the scope drop-down list that explains what scopes are, how the drop-down list works, or what each scope is focused on.
Search Core Results
The Search Core Results Web Part displays the result set. The rest of the Web Parts on the page can be considered helpful and supportive to the Search Core Results Web Part. In the configuration options of this Web Part, you can specify fixed keyword queries that are executed each time a query is run along with how to display the results and query results options.
Probably of most importance are the query results options. In this part of the configuration options (shown in Figure 16-28), you have the following choices to make:
Remove Duplicate Results Select this option if you want to ensure that different result items of the same content item are removed.
Enable Search Term StemmingBy default, stemming in the result set is turned off, even though the indexer performs stemming on inbound words into the index. For example, if the crawler crawls the word “buy,” the indexer will stem the word and include “buy,” “buying,” buys,” and “bought.” But in the result set, by default, if the query term is “buy,” the result set will only display content times that contain the word “buy.” If you select to enable search term stemming, then, in this scenario, the result set will contain content items that include the stemmed words for “buy.”You’ll want to enable this if you want the result set to include content items that might only have stemmed forms of the query terms.
Permit Noise Word QueriesThis is a feature that allows noise words in one language to be queried in another language when working in a cross-lingual environment. For example, “the” in the ENG language is a noise word, but its equivalent in another language might not be a noise word. So you enable the permit noise word queries so that a user who searches on “the” will obtain content items in other languages where “the” is not considered a noise word.
Enable URL SmashingIn this feature, we take the query and “smash” the query terms together and then see if there is a URL that matches exactly the smashed query terms. For example, if someone searches on “campus maps” and there is an intranet Web site with the URL http://campusmaps, then this URL will become the first result in the result set. This is different from URL Matching in that the smashed query terms must match exactly the URL, whereas in URL matching, the query terms only need to match a portion of the URL.
Search results frequently contain several items that are the same or very similar. If these duplicated or similar items are ranked highly and returned as the top items in the result set, other results that might be more relevant to the user appear much further down in the list. This can create a scenario where users have to page through several redundant results before finding what they are looking for.
Results collapsing can group similar results together so that they are displayed as one entry in the search result set. This entry includes a link to display the expanded results for that collapsed result set entry. Search administrators can collapse results for the following content item groups:
Duplicates and derivatives of documents
Windows SharePoint Services discussion messages for the same topic
Microsoft Exchange Server public folder messages for the same conversation topic
Current versions of the same document
Different language versions of the same document
Content from the same site
By default, results collapsing is turned on in Office SharePoint Server 2007 Search. Results collapsing has the following characteristics when turned on:
Duplicate and Derivative Results Collapsing When there are duplicates in the search results and these are collapsed, the result that is displayed in the main result set is the content item that is the most relevant to the user’s search query. With duplicated documents, factors other than content will affect relevance, such as where the document is located or how many times it is linked to.
Site Results Collapsing When search results are collapsed by site, results from that site are collapsed and displayed in the main result set in one of two ways: content from the same site is grouped together, or content from the same folder within a particular site is grouped together, depending on the type of site (as described in the following sections).
SharePoint Sites All results from the same SharePoint site are collapsed. No more than two results from the same SharePoint site will be displayed in the main results set.
Sites Other than SharePoint Sites For Web sites that are not SharePoint sites, results are collapsed based on the folder. Results from the same site but from different folders within that site are not grouped together in the collapsed results. Only results from the same folder are collapsed together. No more than two results from the same folder for a particular site will be displayed in the main results set.
Finding People in the Search Center
Much of your organization’s most important information is not found in documents or Web sites. Instead, it is found in people. As users in your organization learn to expose (or “surface”) information about themselves, other users will be able to find them based on a number of metadata elements, such as department, title, skills, responsibilities, memberships, or a combination of those factors.
Use the People tab (shown in Figure 16-29) to execute queries for people. When you do this, the result set will list the public-facing My Sites for users whose names match the search query. In addition, the result set will default to sorting by social distance, so the My Colleagues Web Part will appear listing the users who fit the search query and who are also part of your colleague (or social) network.
If you want to view these results by relevance rather than by social distance, click the View By Relevance link and the result set will be displayed in rank order. If you want to change the default ordering of people to display by rank rather than by social distance, change the Default Results View setting in the Modify Shared Web Part properties of the People Search Core Results Web Part.
It is difficult to read a chapter and be expected to remember all the best practices for implementing and administrating a feature set like this. This sidebar summarizes some of what I’ve learned over the last three years regarding search and indexing.
I’ll start by mentioning some planning elements that often crop up when I work with customers in the field. After customers learn about the breadth and depth of what search and indexing can do, they realize that they need to take a long step back and ask two important questions:
Where is our information?
Which information do we want to crawl and index?
These are not inconsequential questions. One of the main reasons for implementing a software package like Office SharePoint Server 2007 is to aggregate your content. If you don’t know where your content is, how can you aggregate it? The reason, often, that we don’t know where our content resides is because it resides in so many different locations. Think about it. In most organizations, content is held all over the place, such as in the following places:
The exciting element about search and indexing in Office SharePoint Server 2007 is the existence of the Business Data Catalog and its ability to act as an abstraction layer between Office SharePoint and the different data interfaces so that we can finally aggregate content using Office SharePoint Server 2007. But if you haven’t take the time to genuinely understand where your data resides, you’ll have a hard time knowing what data should be aggregated using search and indexing.
However, once you know where you data resides and you’ve decided which data will be indexed, you can begin the process of building your content source structure in way that makes sense. Most organizations have found that they can’t “swerve” into success by creating content sources as the need arises. Most have found that it is very helpful to put some forethought and planning into the process of determining which content in their environment will be aggregated using search and indexing, as opposed to aggregating the content by either hosting it in a SharePoint site (which automatically gets indexed) or linking to the content (which might or might not index the content).
So, as you start your planning process, be sure to ask yourself some key questions:
Asking yourself questions like this will help you avoid mistakes when implementing a robust search and indexing topology. Some best practices to keep in mind include the following:
In larger organizations or in organizations with heavy search and indexing needs (say, 75 or 100 content sources or more), you might find that you need a full-time staff member just to manage this area. Consider your staffing needs as your content source topology grows and becomes more complex.
And this was just the overview! As you can see, search and indexing is a deep, wide technology that can provide significant value in your organization. In this chapter, you looked at the Search architecture, the many facets of search administration, and how to work and customize the Search Center and its results page.
The next chapter covers search deployment along with using Search as a feature. Much of the discussion of the Search functionality is behind you now, so it’s time to move on to see how you can implement Search in your enterprise.