About content sources (Office SharePoint Server 2007)

Article
01/10/2017

Applies To: Office SharePoint Server 2007

This Office product will reach end of support on October 10, 2017. To stay supported, you will need to upgrade. For more information, see , Resources to help you upgrade your Office 2007 servers and clients.

Topic Last Modified: 2016-11-14

Content is any item that can be crawled, such as a Web page, a Microsoft Office Word document, business data, or an e-mail message. Content resides in a content repository, such as a Web site, file share, or SharePoint site. A content source specifies settings that define how and on what schedule content is crawled. It includes one or more addresses of a content repository from which to start crawling, also called start addresses. These settings apply to all start addresses within the entire content source.

Default content source

If your organization has to crawl only the content that is contained in the SharePoint sites, you might not have to create an additional content source. Microsoft Office SharePoint Server 2007 defines a default content source during its initial deployment. The default content source is named Local Office SharePoint Server sites. The start addresses of all Web applications in the server farm are automatically included as part of the default content source. This content source is not crawled, by default. To index the content in the default content source, you have to either manually start or schedule crawls for it.

Creating a new content source

When you create a content source, you specify settings that define the kind of content it crawls, when the content is crawled, and crawling behavior, such as how deep to crawl within the namespace of the start address or how many server hops to allow. If you have multiple kinds of content repositories that you want to crawl, or you want to crawl some content repositories on different schedules, you have to create additional content sources. Office SharePoint Server 2007 can support up to 500 content sources per Shared Service Provider (SSP), and each content source can contain up to 500 start addresses. For more information about when to create additional content sources, see the “Plan content sources” section of Plan to crawl content (Office SharePoint Server). For more information about how to configure crawling behavior, see Limit or increase the quantity of content that is crawled (Office SharePoint Server).

Types of content repositories

You can only crawl one kind of content repository per content source. That is, you can create a content source that contains URLs for SharePoint sites and another that contains URLs for file shares. But you cannot create a single content source that contains URLs to both SharePoint sites and file shares.

The following table lists the kinds of content repositories that Office SharePoint Server 2007 can crawl:

This kind of content source	Includes this kind of content
SharePoint sites	SharePoint sites from the same farm or different Microsoft Office SharePoint Server 2007, Windows SharePoint Services 3.0, or Microsoft Search Server 2008 farms SharePoint sites from Microsoft Office SharePoint Portal Server 2003 or Microsoft Windows SharePoint Services 2.0 farms Note The Office SharePoint Server 2007 crawler can automatically crawl all Office SharePoint Server 2007, Windows SharePoint Services 3.0, and Search Server 2008 sites and subsites. To crawl previous versions of SharePoint products and technologies, you must specify the URL of each top-level site (site collection) and each subsite that you want to crawl. Sites listed in the Site Directory of Microsoft Office SharePoint Portal Server 2003 farms are crawled when the portal site is crawled. For more information about the Site Directory, see About the Site Directory (https://go.microsoft.com/fwlink/?LinkId=88227).
Web sites	Web content within your organization not found on SharePoint sites. Content on Web sites on the Internet Note The crawler behaves in the same manner when it uses the Web sites content type or the SharePoint sites content type. Only the crawl settings that you can configure for these content source types differ.
File shares	Content on file shares within your organization
Exchange public folders	Microsoft Exchange Server content
Lotus Notes	Content stored in Lotus Notes databases Note The Lotus Notes content source option does not appear in the user interface until you have configured the index server to work with Lotus Notes. For more information, see Prepare to crawl Lotus Notes (Office SharePoint Server 2007).
Business Data (Enterprise Edition only)	Business data that is stored in line-of-business applications You can create one content source to crawl all applications registered in the Business Data Catalog, or you can create separate content sources to crawl individual applications that are registered in the Business Data Catalog. Before you create a content source for business data, you must register the applications hosting the data in the Business Data Catalog. For more information, see Register business applications in the Business Data Catalog.

Start address of content

Each content source contains a list of start addresses that the crawler uses to connect to the repository of content. Each content source can contain up to 500 start addresses. You cannot crawl the same address using multiple content sources. For example, if you use a particular content source to crawl a site collection and all its subsites, you cannot use a different content source to crawl one of those subsites on a different schedule.

Crawling content

You can use a content source to manually start a crawl or schedule when and how often the selected content source is crawled. If you want to crawl content in a part of your content source on a different schedule, you must create a separate content source for that content. For performance and manageability reasons, we recommend that you use as few content sources as possible. For more information about starting a crawl manually or scheduling a crawl, see Crawl content (Office SharePoint Server 2007).

Authentication

When the crawler accesses the start addresses listed in a content source, the crawler must be authenticated by and granted access to the servers that host that content. The user account that is used by the crawler must have at least read permission to crawl content. By default, Office SharePoint Server 2007 uses the default content access account and uses NTLM when authenticating with servers. For more information, see Configure how the crawler authenticates (Office SharePoint Server 2007).