Plan FAST Search Server farm topology (FAST Search Server 2010 for SharePoint)
Published: May 12, 2010
This topic describes the Microsoft FAST Search Server 2010 for SharePoint farm topology, including the various components that can be scaled out on multiple servers for performance and fault-tolerance reasons.
For more information on the overall Microsoft SharePoint Server 2010 farm topology, see Plan search topology (FAST Search Server 2010 for SharePoint).
In this article:
The content flow
FAST Search Server 2010 for SharePoint retrieves content for indexing using one or more of the supported indexing connectors. The FAST Content Search Service Application (SSA) is the default indexing connector and retrieves content from various content sources such as SharePoint content repositories, Web servers, Exchange folders, line of business data and file servers. You can use other FAST Search Server 2010 for SharePoint indexing connectors for more specific content retrieval scenarios. For more information about the indexing connector options, see Plan for crawling and federation (FAST Search Server 2010 for SharePoint).
Item processing extracts searchable content from retrieved documents, and processes the items based on the written language.
The Indexing component converts the searchable content into inverted indexes that are in turn used by the query matching.
Query processing processes user queries by performing query transformations, such as synonym expansion, before the actual query matching against the index.
Query matching uses the search indexes to return items that match a user query. The items are returned in a query hit list that is sorted by the relevancy to the specified query.
FAST Search Server 2010 for SharePoint interacts with Active Directory and claims infrastructure to resolve permissions and group memberships. It then only returns items the current user is allowed to see, according to the settings of the content source.
Components within the FAST Search Server 2010 for SharePoint farm
FAST Search Server 2010 for SharePoint can run on a single node. Or, you can scale out to run one or more of the components on multiple nodes. In that case the system can index a larger number of items, handle more item updates, reduce indexing latency, or respond to more queries per second.
The following figure shows the main components of the FAST Search Server 2010 for SharePoint.
FAST Search for SharePoint farm topology
The following subsections describe the functionality for each component.
The item processing component receives items to be indexed from indexing connectors. and process the items according to the given configuration. It then sends the processed items to the indexing service.
Key features of the item processing service are:
Mapping from crawled properties to managed properties. Managed properties contain the content that will be indexed including metadata associated with the items.
First, you will discover crawled properties on an initial set of crawled items. Based on this, you can change the mapping to managed properties.
Parsing of document formats such as Word, Excel and PDF. This includes extracting searchable text and metadata from these formats.
Extracting properties from crawled content. Property extraction detects various properties such as names and dates, and maps them into managed properties. In this manner you can query these properties, and also apply query refinement based on these properties. Key extracted properties are company names, people names, locations, and dates.
It is also possible to create custom property extractors using, for example, a dictionary of product names relevant to your organization.
Linguistic processing of items before indexing. In search, linguistics is defined as the use of information about the structure and variation of languages so that users can more easily find relevant information. The item’s relevancy with regard to a query is not necessarily decided based on words common to both query and document, but instead the extent that its content satisfies the user’s need for information.
The linguistic processing includes detection of the written language and linguistic normalization of the content according to the given language. Linguistic normalization includes character normalization and normalization of stemming variations.
FAST Search Server 2010 for SharePoint enables you to customize how items are processed—for example, by specifying what kinds of properties to extract and how they can be queried.
The content distributor communicates with the indexing connectors and organizes the feeding of documents from indexing connectors to the indexing service. You can set up a primary and a backup content distributor for fault-tolerance.
You can set up multiple item processing nodes for fault-tolerance and performance. Certain item processing operations are processing-intensive and will require more than one item processing node to handle the feeding rate.
Web link analysis (Web Analyzer)
The Web Analyzer has two main functions: It analyzes search clickthrough logs and hyperlink structures. Both contribute to better ranked search results.
Items that show many clicks in the search clickthrough log are popular and therefore receive better rank scores than less-viewed items. Items that are linked to from many other items are also perceived to be more relevant for the user and therefore receive better rank scores.
The Web Analyzer improves search relevancy by analyzing the link graph and adding anchor texts and a query independent rank boost based on link cardinality to the items in the index. Anchor texts describe the items they refer to and will improve recall and relevancy when a query term matches the anchor text. Items with many links pointing to them will be ranked higher
The Web Analyzer may scale out to many nodes to reduce the total time that is needed for the analysis. This is done by adding dedicated lookup database components and link processing components that are used during the link analysis.
The link processing component receives tasks from the Web Analyzer during link processing. Large scale installations use multiple link processors.
The lookup database component represents a key/value lookup server that retrieves the link information generated by the link processing. The item processing looks up the link information for an item using the URL as key. Large scale installations use multiple lookup database components.
The search cluster provides the main topology for indexing and query matching. These components require their own scaling models using a matrix of servers in a row/column configuration. The following figure shows the key concepts used in a search cluster topology. The figure illustrates a deployment with four rows: two pure indexer rows and two pure search rows. In real deployments, it is common to combine indexer and search rows in order to reduce hardware footprint.
FAST Search Cluster Architecture
Index column The complete searchable index can be split into multiple disjoint index columns when the complete index is too large to reside on one server. A query will be evaluated against all index columns within the search cluster, and the results from each index column are merged into the final query hit list.
Search row A set of search nodes that contain all items indexed within the search cluster. A search row consists of one indexing node for each index column within the search cluster. You use multiple search rows to provide performance load sharing and fault-tolerance. Search rows can be combined with both types of indexer rows (see below).
Primary and backup indexer You can configure a backup indexer node for fault tolerance. The two indexers produce the same set of indexes, but only the primary indexer distributes the indexes to the search rows. Indexer rows cannot be combined on the same indexing nodes.
The primary and backup indexer nodes are specified as indexer rows in the deployment configuration file (deployment.xml). Search rows and indexer rows use the same row numbering in the deployment configuration file.
In the deployment configuration file (deployment.xml), a search row corresponds to
searchcluster/row/search=”true”. An indexer row can have the roles
none. An indexer row can also be a search row.
Searchcluster/row/index=”none” is a pure search row.
Searchcluster/row/search=”false” is a pure indexer row.
The indexing component creates inverted indexes, based on the items that it receives. The indexing component sends these inverted indexes to the query matching component for later use during query evaluation.
The indexing service consists of two components, the indexing dispatcher component and the indexing component. If the indexing service is deployed on multiple nodes, instances of these components will also be deployed on multiple nodes.
If you have more than one index column, you must combine the indexes to yield consistent search results. In this case, you have to deploy one indexing node for each index column. The indexing dispatcher manages the routing of processed items to the correct column.
The indexing service scales out according to the number of items. If the indexing service runs on a single node, both the number of items it can handle per second and the total number of items it can include in the index are limited. To scale out the indexing service, you can deploy it across more than one index column. Each index column will contain one part of the index, and the combined set of index columns will form the complete index. In this case, each indexing node will handle only a part of the whole index, and so that it scales out both the number of items that can be indexed per second and the total number of items. Additionally, backup indexing nodes can provide fault tolerance.
It is also possible to use multiple indexing dispatchers for both fault tolerance and performance reasons. You normally deploy the indexing dispatcher to the same node as the primary indexing node.
The query matching service uses the inverted indexes created by the indexing service to retrieve the items that match a query and then return these items as a query hit list. A query usually contains several terms combined with query operators, such as AND and OR. The query matching service looks up each term in the index and retrieves a list of items in which that term appears. In the case of an AND operator, for example, the query hit list will consist of the set of items that contain all the terms. The order of the returned items is based on the requested sorting mechanism, which is usually a complex ranking that is calculated from various item properties or a sort based on one or more of the item properties.
The query matching service can also return a hit highlighted summary for each item in the query hit list. A hit highlighted summary consists of a fragment of the original item in which the matching query terms are highlighted.
The query matching service is responsible for the deep refinement that is associated with query results. Query refinement enables drilling down into a query result by using aggregated statistical data that was computed for the query result. The query matching service maintains aggregation data structures to enable deep refinement across large result sets.
You can deploy the query matching service in a row/column setup to achieve fault-tolerance and scaling in content and query volume. Index columns provide ways to scale out for content volume, by partitioning the overall index into a set of disjoint columns. Search rows provide ways to scale out for query volume, by duplicating the same partition of the index across more than one query matching node.
The number of columns in the query matching service always equals the number of columns in the indexer service. The reason is that the index columns represent a partitioning of the index, and each query matching node can handle only one such partition of the index.
Search rows and indexer rows scale out independently. A search row duplicates another search row to provide fault tolerance and an increased capacity for queries. An indexer row serves as a backup mechanism for fault tolerance purposes during indexing.
The query processing component performs pre-processing of queries and post-processing of results. Query processing includes query–language parsing, linguistic processing, and item-level security processing. Result processing includes merging the results from multiple index columns, formatting the query hit list, formatting the query refinement data, and removing duplicates.
The query processing component interacts with the FAST Search Authorization (FSA) component to make sure that the user performing a query sees only the results that he or she is authorized to see. The query processing service therefore validates the user’s permissions and rewrites the incoming query with an access filter that corresponds to the current user and group membership.
The query processing service can be scaled out across multiple nodes to handle fault-tolerance and more queries per second. In this case, all the nodes have to be set up in the same manner.
The SharePoint Server 2010 Central Administration and site collection user interfaces provide the administrative interfaces for managing the FAST Search Server 2010 for SharePoint deployment and features. Common system administration services include UI and cmdlet based system and feature configuration, logging, index schema administration and search authorization.
Certain administrative operations can only be performed by using Windows PowerShell cmdlets or by using command line tools.
The administration component contains functionality to control the search experience, such as determining how to perform property extraction, ascertaining which synonyms to use, and determining which items to use as best bets.
The FSA manager is a part of the administration service that manages user authorization for indexed content. This ensures that only items that a user is entitled to read appear in the query results. The FSA manager communicates with Claims services, Active Directory services or other LDAP based directory services to manage the authorization process.
Index Schema administration
A key part of the administration service is the index schema administration. The index schema contains all the configuration entities that are needed to generate the configuration files that are related to the index schema for all the other services in the system.
The index schema controls which managed properties of an item will be indexed, how the properties will be indexed and which properties can be returned in the query hit list.
The rank profile is a part of the index schema that controls how the query hit list will be sorted by relevancy. You can configure the relevance calculation using a set of rank profile parameters.
FAST Search Authorization (FSA)
The FAST Search Authorization (FSA) manager is a part of the administration service that manages user authorization for indexed content.
The FSA manager grants users access to indexed items based on user’s read permissions on the content source repositories. This ensures that only items that a user is entitled to read appear in the query results.
The FSA manager communicates with Active Directory services or other LDAP based directory services to manage the authorization process.
FAST Search Web Crawler
The FAST Search Web crawler is an optional indexing connector that can be used for complex Web crawl scenarios involving a mix of Internet and Intranet sites.
You can find more information about the differences between crawling Web sites with the FAST Search Content SSA and with the FAST Search Web crawler in Plan for crawling and federation (FAST Search Server 2010 for SharePoint).
The FAST Search Web crawler reads Web pages and follows links on the pages to process a complete Web of items. It then passes the retrieved items to the item processing service.
For further architecture details on the FAST Search Web crawler, see Crawling Web content with the FAST Search Web crawler.
ConceptsPlan search topology (FAST Search Server 2010 for SharePoint)
Plan for crawling and federation (FAST Search Server 2010 for SharePoint)
Recommendations: Redundancy and availability (FAST Search Server 2010 for SharePoint)
Performance and capacity planning (FAST Search Server 2010 for SharePoint)
The index schema and its features (FAST Search Server 2010 for SharePoint)
Performance and capacity test results (FAST Search Server 2010 for SharePoint)