Microsoft SharePoint Portal Server 2001 Resource Kit

Archived content. No warranty is made as to technical accuracy. Content may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist.
On This Page

About Cairn Energy
Automatic Categorization
Results
Summary

The Category Assistant provides an automatic process to categorize documents quickly and efficiently regardless of their storage location. With the Category Assistant, employees can use the company vocabulary to structure existing content for portal users.

This chapter describes using the Category Assistant by Cairn Energy to categorize corporate documents.

Cairn Energy faced two challenges when it came to categorizing documents in Microsoft® SharePoint™ Portal Server 2001:

  • Manually categorizing large numbers of documents is time consuming because someone has to go through each document. Cairn Energy did not want to burden an internal resource in this way. 

  • Manually categorizing any number of external documents is not possible because you cannot check in external documents. Cairn Energy did not want to populate the dashboard site at the beginning of the project with large numbers of documents. 

About Cairn Energy

Cc750147.spacer(en-us,TechNet.10).gif Cc750147.spacer(en-us,TechNet.10).gif

Cairn Energy PLC is an oil and gas exploration company with offices in Scotland, the Netherlands, India, and Bangladesh. Information the company relies on is spread across the world and key decision makers need access to this information wherever they are located. Drilling and exploration operations generate large amounts of information that needs to be distributed, analyzed, and preserved. High-value investment decisions often need to be made with short notice. These decisions can only be made through collaboration between globally distributed teams. These decisions rely in part on access to the intellectual property amassed from previous exploration projects. Moreover, the results of these decisions frequently demand the rapid provision of corporate information services to remote locations. Therefore, Cairn Energy PLC requires a flexible infrastructure to facilitate the effective capture of information and its subsequent organization and distribution.

Automatic Categorization

Cc750147.spacer(en-us,TechNet.10).gif Cc750147.spacer(en-us,TechNet.10).gif

Automatic categorization with the Category Assistant is composed of two parts. First, you must "train" the Category Assistant to recognize documents belonging to particular categories. Second, SharePoint Portal Server crawls documents for inclusion in an index. During this latter process, SharePoint Portal Server associates documents with categories based on information from the training documents.

When using the Category Assistant, consider the following factors:

  • You must provide the Category Assistant with training documents. To do this, you check documents in to the workspace. During check-in, you specify at least one category. After check-in, you must publish the documents before they can be included in an index. You can also include external documents in the set of training documents by creating Web links in the workspace to the documents. 

  • You can use the Category Assistant to categorize any document included in the index no matter where it resides. SharePoint Portal Server creates an index of searchable information that includes all workspace content. It can also include a variety of information stored outside the workspace on other SharePoint Portal Server workspaces, Web sites, file systems, Microsoft Exchange Servers, and Lotus Notes databases. When indexed, documents are categorized by the Category Assistant. The Category Assistant's precision can be controlled so that more or fewer documents are categorized. 

After evaluating the Category Assistant in a test environment, the Cairn Energy project team decided that in order to ensure the most accurate results, careful planning was in order. The following sections outline the planning process. In addition, they describe how this team developed training documents.

Selecting Training Documents

The project team attributes the quality of categorization achieved by the Category Assistant to the quality of the training documents used. Although finding good training documents is time consuming, it greatly reduced the amount of time spent categorizing documents overall.

The team found that understanding how the Category Assistant learns was useful when choosing training documents. The training process builds a list of definitive terms for each category by comparing training documents in a single category to those in other categories. The Category Assistant identifies the top 300 shared features among the training documents for a category. The Category Assistant then applies an algorithm to all documents included in the index to determine the proposed category membership.

The project team chose ten categories for training from the overall list of categories. Each category included a minimum of 20 training documents.

The project team found the following points useful for selecting training documents:

  • Explain how the Category Assistant works to the people supplying the training documents. 

  • Create at least ten categories for the Category Assistant to learn. 

  • Use training documents that contain a minimum of 2,000 words each. 

  • Choose documents with a large number of words per file. For example, Microsoft Excel spreadsheets and Microsoft PowerPoint® presentations frequently did not make good training documents. Files with high word counts, such as Microsoft Word documents, Adobe Acrobat files, and files in the Tagged Image File Format (TIFF) format make good training documents. 

  • Use training documents that represent a broad range of examples from the subject category. 

  • Use training documents that cover the category subject throughout the document. Even though documents may start on the subject, if much of the content is not relevant, this lowers accuracy. 

  • You can use training documents that belong to multiple categories. 

Adding Training Documents

The team found that dragging documents into the workspace by using Web folders was the most efficient way to add documents. The team created document profiles that included the Categories attribute. After they added documents to the workspace, they checked them in and published them using the new document profile. The team used these documents as representative samples for training the Category Assistant.

Cairn Energy found that adding training documents individually was time consuming. With SharePoint Portal Server, you can check in multiple documents at once to speed the process. SharePoint Portal Server only includes published documents in the index so you must ensure approval and publishing of documents used for training before running the Category Assistant.

If you do not want to add a document to the workspace but you do want to use it as a training document, you can create a Web link that points to the external document. To do this, create a blank document in the workspace to represent the external document and apply the Web link document profile. You can add the Categories attribute to the Web link profile. When you check in and publish the document in the workspace, you can assign the appropriate category to it. When SharePoint Portal Server crawls the workspace, it also crawls the URL associated with the Web link and crawls the metadata included on the document profile. This includes this document in the index, but leaves it in the original location. By following this process, you can include external documents in the set of training documents.

The team developed the following process:

  • Add multiple documents to the workspace by using Web folders. 

  • Categorize documents in the workspace by using the document profile. 

  • Add Web links in the workspace to external documents. 

  • Categorize the external documents by using the Web Link document profile. 

After completing these steps, you can begin training the Category Assistant.

Training the Category Assistant

To categorize documents automatically, you must complete two tasks. First, train the Category Assistant with a set of documents that represent your categories. Second, apply the newly learned categories to all the documents included in the index. You can train the Category Assistant first and then schedule SharePoint Portal Server to perform a full crawl at the next appropriate time. At Cairn Energy, the team found this useful because categorization and crawling affect overall performance of SharePoint Portal Server.

Note To access the Category Assistant, in the workspace, right-click the Categories folder, and then click Properties.

Monitoring Training

You can monitor the training process by using the Microsoft Windows® 2000 Event Viewer Log.

If insufficient training documents are available, SharePoint Portal Server generates an error message in the Application log as MSSearch Gatherer Event 3065 workspace name_train$$$ Catalog. 

When you initiate a training session, SharePoint Portal Server enters a message in the Application log as MSSearch Gatherer Event 3035 workspace name_train$$$ Catalog. 

Upon successful completion of the training, SharePoint Portal Server generates a message in the Application log as MSSearch Gatherer Event 3018 workspace name_train$$$ Catalog. 

All documents that you categorize using the document profile during check-in are potential training documents. It is important to maintain the accuracy of categories applied to documents in this way. If you retrain the Category Assistant by using poor quality training documents, you affect the accuracy of the automatic categorization performed by the Category Assistant. Cairn Energy only trained the Category Assistant when the high quality of training documents was certain.

Categorizing Documents

After you complete the training, you can manually start the crawl process so that SharePoint Portal Server includes the documents in the index. Alternatively, you can defer the crawl until the next scheduled time. Each time SharePoint Portal Server performs a full update, it also categorizes the documents included in the index.

You can limit the documents you automatically categorize to documents stored in the workspace, or you can choose to include documents stored outside the workspace. Cairn Energy automatically categorized all documents regardless of their location.

The team initially set the Category Assistant to "High Precision" when training it. You can update the index by using the same training documents. Cairn Energy found that reducing the precision increases the number of documents suggested by the Category Assistant. In addition, the quality of the training documents affects the accuracy of the suggested categories. Cairn Energy experimented with reducing precision, but the team decided that including fewer documents with higher accuracy was more suitable to their deployment.

After SharePoint Portal Server categorizes a document, you can view the proposed category structure.

To view the proposed category structure:
  1. In Web folders, right-click the document you want to view, and then click Properties

  2. Click the Search and Categories tab. 

The categories are listed under Categories suggested by the Category Assistant. SharePoint Portal Server displays documents from external content sources in the Categories folder in the workspace.

If the Category Assistant generates inaccurate results, you can override the suggested categories for documents stored within the workspace.

To override the categories suggested by the Category Assistant:
  1. Open the properties page for the document for which you want to override the suggested categories. 

  2. On the Search and Categories tab, clear the Show Suggested Categories check box. 

You cannot override this setting for documents stored outside of the workspace. Therefore, it is very important to use good training documents to achieve the highest possible levels of accuracy.

Results

Cc750147.spacer(en-us,TechNet.10).gif Cc750147.spacer(en-us,TechNet.10).gif

Cairn Energy measured the accuracy of the Category Assistant by looking at the documents displayed in the Categories folder and assessing the relevance.

The following section summarizes the deployment environment and the results from the Category Assistant. The project team estimates the Category Assistant to be about 90 percent accurate. The Category Assistant retained accuracy after the initial testing and categorization phase into the second phase. The team automatically categorized only a small proportion of the overall company documents. With the addition of further training documents, it is expected that the Category Assistant will continue with the same level of accuracy.

Index Statistics

The following table provides information about the documents included in the index for Cairn Energy.

Index Statistics 

Sources

Number of Documents

Size

Health

Intranet

2,315

500 megabytes (MB)

99 percent

Public folder

5,477

500 MB

99 percent

Share one

4,295

500 MB

99 percent

Share two

9,986

4 GB

98 percent

Share three

60,765

16 GB

97 percent

Document Types

Document types for the Cairn Energy deployment break into the following percentages:

  • 70 percent Word 

  • 15 percent Excel 

  • 5 percent Portable Document Format (PDF) 

  • 5 percent PowerPoint 

  • 5 percent TIFF 

All documents are in English.

Category Assistant Configuration

The project team set the Category Assistant prevision level to High Precision. They also enabled the Category Assistant to categorize all documents, regardless of their storage location.

Category Assistant Results

The following table summarizes the results after training the category assistant. The first update shows the number of training documents and the suggested category results after updating the index the first time. The second update shows the results after updating the index a second time.

Results of the Category Assistant 

 

Training Categories

Number of Training documents

Number of Suggested Category

Number of Training Documents

Number of Suggested Category

Finance

21

303

48

3,712

HSE

24

62

24

62

IM

20

50

39

140

Procurement

20

1,920

20

1,920

Risk

21

128

21

128

Procedures

0

0

27

176

Asset1

10

25

10

25

Asset2

10

39

15

101

Asset3

0

11

24

11

Asset4

23

105

30

170

Asset5

0

0

30

597

The Procurement and Finance categories present the most impressive results. The Category Assistant identified documents as purchase orders and categorized them all appropriately. With the Category Assistant set to "High Precision," categorization was highly accurate and resulted in very few ineffective results.

Summary

Cc750147.spacer(en-us,TechNet.10).gif Cc750147.spacer(en-us,TechNet.10).gif

The Category Assistant is an extremely valuable tool for rapidly applying structure to information when deploying SharePoint Portal Server. It is essential that you plan effectively and identify suitable categories and representative training documents.

Cc750147.spacer(en-us,TechNet.10).gif