About content sources (Search Server 2008)

Applies To: Microsoft Search Server 2008

 

Topic Last Modified: 2009-04-27

Note

Unless otherwise noted, the information in this article applies to both Microsoft Search Server 2008 and Microsoft Search Server 2008 Express.

Content is any item that can be crawled, such as a Web page, a Microsoft Office Word document, business data, or an e-mail message. Content resides in a content repository, such as a Web site, file share, or SharePoint site. A content source specifies settings that define how and on what schedule content is crawled. It includes one or more addresses of a content repository from which to start crawling, also named start addresses. These settings apply to all start addresses within the whole content source.

Default content source

If your organization has to crawl only the content that is contained in the SharePoint sites, you might not have to create an additional content source. Search Server 2008 defines a default content source during its initial deployment. The default content source is named Local Office SharePoint Server sites. The start addresses of all Web applications in the server farm are automatically included as part of the default content source. This content source is not crawled, by default. To index the content in the default content source, you have to either manually start or schedule crawls for it.

Creating a new content source

When you create a content source, you specify settings that define the kind of content it crawls, when the content is crawled, and crawling behavior, such as how deep to crawl within the namespace of the start address or how many server hops to allow. If you have multiple kinds of content repositories that you want to crawl, or you want crawl some content repositories on different schedules, you have to create additional content sources. Search Server has one Shared Service Provider (SSP) that supports up to 500 content sources. For more information, see the “Plan content sources” section of Plan to crawl content (Search Server 2008). For more information about how to configure crawling behavior, see Limit or increase the quantity of content that is crawled (Search Server 2008).

Types of content repositories

You can crawl only one kind of content per content source. That is, you can create a content source that contains URLs for SharePoint sites and another that contains URLs for file shares. But you cannot create a single content source that contains URLs for both SharePoint sites and file shares.

The following table lists the kinds of content that Search Server can crawl and index:

This kind of content source Includes this kind of content

SharePoint sites

  • SharePoint sites from the same farm or different Microsoft Office SharePoint Server 2007, Windows SharePoint Services 3.0, or Search Server 2008 farms

  • SharePoint sites from Microsoft Office SharePoint Portal Server 2003 or Microsoft Windows SharePoint Services 2.0 farms

    Note

    The Search Server 2008 crawler can automatically crawl all Office SharePoint Server 2007, Windows SharePoint Services 3.0, and Search Server 2008 sites and subsites. The crawler can crawl previous versions of SharePoint products and technologies. But you must specify the URL of each top-level site (site collection) and each subsite that you want to crawl.
    Sites listed in the Site Directory of Microsoft Office SharePoint Portal Server 2003 farms are crawled when the portal site is crawled. For more information about the Site Directory, see About the Site Directory (https://go.microsoft.com/fwlink/?LinkId=88227).

Web sites

  • Web content within your organization not found on SharePoint sites

  • Content on Web sites on the Internet

    Note

    The crawler behaves in the same manner when it uses the Web sites content type or the SharePoint sites content type. Only the crawl settings that you can configure for these content source types differ.

File shares

  • Content on file shares within your organization

Exchange public folders

  • Microsoft Exchange Server content

Lotus Notes

  • Content stored in Lotus Notes databases

    Note

    The Lotus Notes content source option does not appear in the user interface until you have configured the index server to work with Lotus Notes. For more information, see Prepare to crawl Lotus Notes (Search Server 2008).

Start address of content

Each content source maintains a list of start addresses that the crawler uses to connect to the repository of content. Each content source can contain up to 500 start addresses. You cannot crawl the same address using multiple content sources. For example, if you use a particular content source to crawl a site collection and all its subsites, you cannot use a different content source to crawl one of those subsites on a different schedule.

Crawling content

You can use a content source to manually start a crawl or schedule when and how often the selected content source is crawled. If you want to crawl content in a part of your content source on a different schedule, you must create a separate content source for that content. For performance and manageability reasons, we recommend that you use as few content sources as possible. For more information about starting a crawl manually or scheduling a crawl, see Crawl content (Search Server 2008).

Authentication

When the crawler accesses the start addresses listed in a content source, the crawler must be authenticated by and granted access to the servers that host that content. The user account that is used by the crawler must have at least read permission to crawl content. By default, Search Server uses the default content access account and uses NTLM when authenticating with servers. For more information, see Configure how the crawler authenticates (Search Server 2008).

See Also

Concepts

Plan to crawl content (Search Server 2008)
Configure searches to return blog post results (Search Server 2008)
Configure client certificates for crawling an SSL site (Search Server 2008)