About content sources (Office SharePoint Server 2007)
Updated: April 16, 2009
Applies To: Office SharePoint Server 2007
Content is any item that can be crawled, such as a Web page, a Microsoft Office Word document, business data, or an e-mail message. Content resides in a content repository, such as a Web site, file share, or SharePoint site. A content source specifies settings that define how and on what schedule content is crawled. It includes one or more addresses of a content repository from which to start crawling, also called start addresses. These settings apply to all start addresses within the entire content source.
Default content source
If your organization has to crawl only the content that is contained in the SharePoint sites, you might not have to create an additional content source. Microsoft Office SharePoint Server 2007 defines a default content source during its initial deployment. The default content source is named Local Office SharePoint Server sites. The start addresses of all Web applications in the server farm are automatically included as part of the default content source. This content source is not crawled, by default. To index the content in the default content source, you have to either manually start or schedule crawls for it.
Creating a new content source
When you create a content source, you specify settings that define the kind of content it crawls, when the content is crawled, and crawling behavior, such as how deep to crawl within the namespace of the start address or how many server hops to allow. If you have multiple kinds of content repositories that you want to crawl, or you want to crawl some content repositories on different schedules, you have to create additional content sources. Office SharePoint Server 2007 can support up to 500 content sources per Shared Service Provider (SSP), and each content source can contain up to 500 start addresses. For more information about when to create additional content sources, see the “Plan content sources” section of Plan to crawl content (Office SharePoint Server). For more information about how to configure crawling behavior, see Limit or increase the quantity of content that is crawled (Office SharePoint Server).
Types of content repositories
You can only crawl one kind of content repository per content source. That is, you can create a content source that contains URLs for SharePoint sites and another that contains URLs for file shares. But you cannot create a single content source that contains URLs to both SharePoint sites and file shares.
The following table lists the kinds of content repositories that Office SharePoint Server 2007 can crawl:
|This kind of content source||Includes this kind of content|
Exchange public folders
Business Data (Enterprise Edition only)
Start address of content
Each content source contains a list of start addresses that the crawler uses to connect to the repository of content. Each content source can contain up to 500 start addresses. You cannot crawl the same address using multiple content sources. For example, if you use a particular content source to crawl a site collection and all its subsites, you cannot use a different content source to crawl one of those subsites on a different schedule.
You can use a content source to manually start a crawl or schedule when and how often the selected content source is crawled. If you want to crawl content in a part of your content source on a different schedule, you must create a separate content source for that content. For performance and manageability reasons, we recommend that you use as few content sources as possible. For more information about starting a crawl manually or scheduling a crawl, see Crawl content (Office SharePoint Server 2007).
When the crawler accesses the start addresses listed in a content source, the crawler must be authenticated by and granted access to the servers that host that content. The user account that is used by the crawler must have at least read permission to crawl content. By default, Office SharePoint Server 2007 uses the default content access account and uses NTLM when authenticating with servers. For more information, see Configure how the crawler authenticates (Office SharePoint Server 2007).