Chapter 26 - Performance Tuning Analysis Services

Article
01/28/2010

Microsoft® SQL Server™ 2000 Analysis Services provides a complete server solution for online analytical processing (OLAP) and data mining business needs, while simplifying the typically complex tasks involved with the design and maintenance of OLAP servers. One of the benefits of Analysis Services is the ability to tune the performance of the Analysis server to meet specific business requirements. This chapter presents a discussion of performance tuning for increased overall, querying, and processing performance, including detailed information about Analysis Services architecture. For information about tuning the performance of the SQL Server 2000 relational database that manages data used by Analysis Services, see Chapter 20, "RDBMS Performance Tuning Guide for Data Warehousing."

Introduction

OLAP products are becoming an essential part of decision support systems because of the flexibility and performance offered by the presentation of large amounts of information in an easily accessible form. Analysis Services allows the high resource demand needed for such information extraction to be managed at the server level, providing an additional presentation layer while simultaneously decreasing the ad hoc query load on data warehouses.

This chapter assumes the reader has a basic knowledge of SQL Server 2000 Analysis Services and a familiarity with basic data warehousing concepts.

Why Use OLAP?

When attempting to query information in a multidimensional fashion, Analysis Services offers two very important advantages over relational databases—flexibility and speed.

Analysis Services provides MDX, a robust query language specifically designed for working with multidimensional information. Typically, data warehouses are implemented in a relational database system, such as SQL Server 2000, designed to support two-dimensional queries using Structured Query Language (SQL) as the query language of choice.

Consider, for example, the following two queries. The first query, written in SQL, retrieves the average economic income per household, grouped by product, from a relational database:

CREATE TABLE 
#qry1_temp1 
(product_id INT, households INT, total_ei MONEY)
INSERT INTO
#qry1_temp
(product_id, households, total_ei)
SELECT
product_id, COUNT(DISTINCT household-id), SUM(economic_income)
FROM
VLDBMart.dbo.Account_prof_fact
GROUP BY
product_id
SELECT
a.prod_name, b.total_ei / b.households
FROM
VLDBMart.dbo.ProductDim a, #qry1_temp1 b
WHERE
a.product_id = b.product_id

By contrast, this query is written in MDX and uses a multidimensional database to retrieve the same information:

WITH MEMBER
[Measures].[Average Economic Income]
AS
'SUM({[Measures].[Economic Income]}) / ([Measures].[Distinct Household Count])'
SELECT
{[Measures].[Average Economic Income]} 
ON COLUMNS,
{[ProductDim].[Product ID].MEMBERS}
ON ROWS
FROM
AccountProfitabilityH0

The first and most apparent difference between these two queries is the amount and complexity of the code needed to perform them. While the SQL query needs temporary table storage to construct a single attribute and multiple database actions to retrieve aggregated results, the MDX query needs a single attribute of information, created in memory, and a single database action to retrieve the same results.

The other difference between these two queries is processing speed. Both queries were executed on the same hardware platform. The first query took approximately 88 seconds to complete; the second query took only four seconds—in other words, the first query was 2200 percent slower than the second query. Because Analysis Services stores and manages aggregations in a multidimensional structure, access to aggregated data is considerably faster.

In a data warehousing solution, available resources are typically at a premium; such resources should be allocated as little and as late as possible, and should be released as early as possible. Analysis Services allows the full range of data warehousing information to be made available in an aggregated form while minimizing the data warehousing resources typically required to support such availability.

As impressive as this is, Analysis Services can supply even better querying and processing performance. With the proper adjustment and design consideration, querying and processing performance can be improved through the judicial application of techniques discussed in this chapter.

Architecture

The performance tuning of Analysis Services can be a complex series of tasks, involving many elements within both server and client components. An overview of Analysis Services architecture clarifies how the server and client components interact with each other and helps in identifying bottlenecks in querying and processing performance.

Overview

The following diagram details the architecture of Analysis Services in terms of data and meta data access.

Cc917653.c41b001(en-us,TechNet.10).gif

Administration of Analysis Services is performed with Analysis Manager, a user interface for the Decision Support Objects (DSO) library. The DSO library consists of a robust object model constructed from COM interfaces, easily accessible by any programming language capable of handling early or late bound COM objects. This includes languages such as Microsoft Visual Basic®, Visual C++®, Microsoft Visual Basic Scripting Edition (VBScript), and Microsoft JScript®.

Access to Analysis Services data is supported through -1098955272PivotTable-1098955272® Service, an OLE DB provider that implements the OLE DB for OLAP specification. PivotTable Service is not directly accessed by client applications; other data access technologies, such as OLE DB, ActiveX® Data Objects (ADO), and ActiveX Data Objects Multidimensional (ADO MD), use PivotTable Service as a data provider. PivotTable Service also provides a mechanism for disconnected usage. Portions of cubes defined and accessed from an Analysis server can be saved on the client for later use when disconnected from the network. Also, users can create cubes and OLAP and data mining models locally, accessing information from data sources through OLE DB providers.

Internet and intranet access to Analysis Services data is also supported, through the use of a special Active Server Page (ASP), Msolap.asp, and companion library, Msmdpump.dll, which can be used through Microsoft Internet Information Services (IIS) to supply secure data transfer between a client application using PivotTable Service and a remote Analysis server. A variety of deployment scenarios is also possible for access to analysis data using Web pages hosted on IIS. In these scenarios, PivotTable Service runs on the IIS server, effectively making the IIS server a client to Analysis Services.

Analysis Services can access relational data from OLE DB compliant databases, even flat files. However, real-time OLAP is supported only by SQL Server 2000 relational databases.

Memory Management

Memory management in Analysis Services is conducted both on the server and on the client. The Analysis server directly manages server-side caching, while the PivotTable Service provider caches data on the client, as well. Both server-side and client-side caching techniques are discussed in this section, including the economic model used to manage server-side processing memory.

Server Memory Management

Analysis Services uses a sophisticated form of memory management on the server, effectively balancing memory conservation against the processing and querying load. Several memory cache areas are allocated for specific purposes on the Analysis server, with an emphasis on separating processing and querying operations, in terms of memory.

Memory usage for an Analysis server is broken out into the following areas:

Read-ahead buffer

The read-ahead buffer is used by Analysis Services when retrieving data.
Dimension memory

Used to cache dimension members and member properties, dimension memory is allocated at startup and retained at all times, subject to dynamic size changes when processing dimensions. With the exception of very large or huge dimensions, all dimension members are loaded into memory when the MSSQLServerOLAPService service is started or a new dimension is added, and occupy part of the virtual address space of the process.

The total size of data for all members and member properties in all dimensions is used to calculate the space required to support dimensions in memory. A guideline for estimating this value is approximately 125 bytes per member, plus the size of data for member properties.

A special tool, the Very Large Dimension Manager (VLDM), is used to support very large and huge dimensions. A dimension whose size exceeds the VLDM Threshold will be stored in a separate process address space. For large dimensions this frees virtual address space in the main Analysis Services process for other uses, at a small cost in the speed of accessing dimension members. This is handled transparently; no administrative action is necessary to use the VLDM.

ROLAP dimensions can be used if a dimension is so large (approximately 20 million members or more) that the members cannot be stored in the address space of a process. In this case, the members are not read into memory; they are retrieved as needed from the relational data source. However, a ROLAP dimension may be used only in a ROLAP cube, and this type of storage has a substantial performance impact. ROLAP dimensions are recommended only when dimensions are so large that they cannot be handled in any other way.
Shadow dimension memory

Before processing a dimension, a copy (shadow dimension) is made of the original dimension contents to allow users to continue to access cubes that contain the dimension while the dimension is being updated. Queries are directed to the shadow dimension during processing. When processing is completed, queries are redirected to the updated dimension and the shadow dimension memory is deallocated.
Process buffer

The process buffer is used to store temporary indexing and aggregation data while processing cubes and dimensions on the Analysis server. Process buffers are used during explicitly initiated cube processing and also during lazy processing as a result of alterations to a changing dimension. One process buffer is allocated for each partition being processed at any given time. Because of the potentially high usage of process buffers, an economic model strategy, discussed later in this chapter, is used to handle allocation of process buffer memory. The process buffer is not used during querying operations.

If the set of aggregations computed for a partition can be fully contained in the process buffer, aggregations can be computed in memory without access to disk. If there is insufficient memory to contain the full set of aggregations, aggregations must be partially computed, stored to disk, and re-read to merge with new partial aggregations. Because of the disk I/O and merging operations, this repeated read-merge-write cycle slows processing considerably.
Query results cache

The query results cache is used to cache cell data for cubes and partitions queried by client applications, and is not used during processing operations. Cube data cached within the query results cache is versioned to maintain synchronization with client applications. An expiration scheme is used to dispose of stale data.

The query results cache is allocated dynamically, and is allowed to fill the available space remaining after subtracting the size of all other memory uses from the value of the HighMemoryLimit registry setting.

Server memory is allocated in the following order:

Read-ahead buffer
Dimension memory
Shadow dimension memory (when processing dimensions)
Process buffer (when processing)
Query results cache

Economic Model Management

The process buffer, in order to provide a consistent memory management structure, uses an economic model to manage the allocation of process buffer memory to memory-intensive processing operations.

Borrowing from the basic set of rules that govern supply and demand, process buffer memory is bought and sold in 64K segments by processing operations. The Analysis Server memory manager establishes the value of memory and the amount of credit to be extended to processing operations, and in effect, serves as a broker for process buffer memory.

The following process steps are used when an operation request needs process buffer memory:

When the MSSQLServerOLAPService service is started, the segments of memory represented by the process buffer are given a purchase price, set at a very low value.
Each operation request that would require an allocation of process buffer memory registers with the Analysis Services memory manager.
The Analysis Services memory manager lends the operation request an amount of credit based on the type and potential complexity of the operation request.
The operation request first attempts to purchase memory, providing its minimum and maximum memory requirements and the amount of credit available to it. Most operation requests cannot establish a maximum memory requirement, and will request additional memory as the operation request requires it.
Based on the amount of credit offered, Analysis Manager will allocate process buffer memory for the first request. If the minimum memory requirement cannot be satisfied with the amount of credit available to the operation request, the needed amount is still allocated.
Because the number of available segments drops after this allocation, the purchase price for memory correspondingly rises.
Operation requests, during execution, continue to keep this memory by paying rent on it. As each operation request completes an amount of work, the Analysis Services memory manager pays it an amount of credit, which can be used to pay rent and purchase additional memory. If the operation request cannot afford to pay rent on its allocated memory, the Analysis Services memory manager revokes allocated memory; however, the amount of memory allocated to the operation request will not drop below its minimum required amount.
If the operation request needs additional memory, it can pay an additional amount of credit to purchase more memory from the Analysis Services memory manager. As memory is allocated, and the cost for memory rises, this mechanism keeps the allocation of memory for large operation requests under control by making it prohibitively expensive to purchase inordinate amounts of memory.
As the operation request releases segments of memory no longer required, the memory is placed back into the pool of available segments. This deallocation of memory has two effects on the economic model. The first effect is to reduce the purchase price of memory, as demand has decreased and supply has increased. The second effect is to reduce the amount of rent required from existing operation requests, making more credit available to each existing operation request to purchase more memory as needed.

The Analysis Services memory manager supports two basic types of operation requests: dynamic and static. Dynamic operation requests require memory to perform a series of operations, and then release memory as each operation is processed. Static operation requests, by contrast, require memory and remain resident. Dynamic requests earn credit by completing their series of operations, while static operation requests earn credit by providing information to other requests.

As with any economic model, situations arise in which there is too much demand without enough supply. In Analysis Services, this occurs primarily when too many operation requests are initiated; since all operation requests are allocated their minimum required amounts of memory, memory becomes prohibitively expensive.

The Analysis Services memory manager can take a number of steps to relieve such an "economic crisis" situation. The Analysis Services memory manager can terminate any operation request as needed, and can revoke memory as required. Processing operations can easily exhaust both physical and virtual memory, so additional steps, discussed later in this chapter, are taken to conserve memory.

Server Settings

The following registry entries are used to control various aspects of server cache management. All of the following registry values are located at \HKEY_LOCAL_MACHINE \Software \Microsoft \OLAP Server\CurrentVersion.

Registry value	Purpose	Default value
HighMemoryLimit	The maximum size, in bytes, of the Analysis Services process virtual memory space. If allocated memory meets or exceeds this value, memory conservation techniques are initiated.	Total amount of physical memory
HugeLevelThreshold	The minimum number of members used to indicate a large level. Large levels are loaded incrementally, to conserve dimension memory.	0x000003e8 (1000)
LowMemoryLimit	The minimum size, in bytes, of process memory space.	Half of physical memory
ProcessReadAheadSize	The size, in bytes, of the read-ahead buffer.	0x00400000 (4194304)
ProcessReadSegmentSize	The size, in bytes, of the process buffer.	0x00400000 (4194304)
ProcessRecordsReportGranularity	The number of records read at one time into the process buffer. Also specifies granularity of processing progress updates for user interface.	0x000003e8 (100)
TempDirectory	The directory path used to store temporary files during processing operations.	\Microsoft Analysis Services\Data subdirectory

Certain settings can be changed directly through Analysis Manager, using the Properties dialog box. For more information about how to access the Properties dialog box, see SQL Server Books Online.

The following table details the tab and option in the Properties dialog box, as well as the corresponding registry value, used to change a specific registry setting.

Properties tab	Option	Registry value
Environment	Minimum allocated memory	LowMemoryLimit
	Memory conservation threshold	HighMemoryLimit
Processing	Read-ahead buffer size	ProcessReadAheadSize
	Process buffer size	ProcessReadSegmentSize

Client Cache Management

PivotTable Service also provides caching services on the client, but the caching provided on the client roughly corresponds to the query results cache maintained by the server. The client cache uses a background thread to manage data, with a version-numbering algorithm to synchronize data on the client with data on the server. Whenever PivotTable Service discovers that the client cache version number is different from the server cache version number, PivotTable Service discards the data stored in the client cache. The client cache data is also discarded when the current database is changed.

When PivotTable Service receives a query, it parses the query and may break it up into several query requests. Each query request is first checked against the client cache; if it can be answered with data from the client cache, and the data is recent, no round trip to the server is required for that particular query request.

Client Settings

The connection string used to connect the PivotTable Service provider to an Analysis server is also used to pass values for various properties to the provider. The following properties are used specifically to control the behavior of the client-side cache.

Property name	Purpose	Default value
Client Cache Size	The size in percent or bytes of the maximum amount of memory that can be allocated to the cache. If the value of this property is 99 or less, the value is interpreted as the percentage of total available (physical and virtual) memory. If the value of this property is 100 or more, the value is interpreted as the number of kilobytes of total available memory.	25

Thread Management

Analysis Services is designed to handle complex multiuser, multiprocess interaction. Both server and client interactions involve multiple thread usage, coordinated to ensure optimized data transfer when used in concert with server and client cache management techniques.

Server Thread Management

The Analysis server uses the following five types of threads to support server operations:

Listener thread

The listener thread waits for new connection requests and is responsible for creating server connections as needed. The listener thread also manages the worker thread pool, assigning worker threads to requests, initiating worker threads if there are not enough active worker threads in the pool, and ending idle worker threads as needed.

The Analysis server uses the BackgroundInterval registry setting to determine how often the listener thread should perform these tasks.
Worker threads

Worker threads serve as the first line of response for requests. Worker threads are responsible for checking the server cache for data pertinent to a request, and for allocating process threads to retrieve data from storage.
Process threads

Process threads retrieve data from storage, asynchronously querying and retrieving stored data in parallel. Different process threads can concurrently retrieve data from different partitions. The size of this thread pool determines how much concurrent activity is being performed in the system.
Log thread

The log thread logs query requests to a query log database maintained by the Analysis server. Logged queries can then be used for usage pattern analysis, which is crucial to the effective design of aggregations and partitions.
Cleaner thread

The results obtained by process threads are stored in the query result cache for later retrieval. Because data that is cached for queries, but infrequently accessed by other queries, can accumulate, the cleaner thread iteratively removes the least recently used cache data. Usually, this thread is set to a below-normal priority, to prevent resource drain from active tasks. However, if allocated cache memory meets or exceeds the halfway point (determined by subtracting the LowMemoryLimit registry value from the HighMemoryLimit registry setting and dividing the result by two), the cleaner thread flushes the cache whenever possible.

If the allocated cache memory meets or exceeds the HighMemoryLimit registry setting, the priority of the cleaner thread is set to normal and the cache is cleaned again. If this step does not reduce allocated cache memory back to the half-way point, the priority of the cleaner thread is set to above normal and the server cache is cleaned again.

The Analysis server uses the BackgroundInterval registry setting to determine how often the cleaner thread should perform these tasks.

Worker and Process Thread Interaction

The following diagram shows how, through the efficient management of threads, such complex interaction is supported.

Cc917653.c41b002(en-us,TechNet.10).gif

The numbers in the previous diagram match the steps needed to resolve a request, as outlined below:

All requests, whether initiated by Analysis Services or by a client application, are submitted to the Analysis server.
The listener thread, which manages the worker thread pool, matches idle worker threads to requests. If there are not enough idle worker threads in the pool to handle requests, the listener thread starts up additional worker threads. If the pool is at the maximum number of worker threads, and no worker thread is idle, a worker thread cannot be allocated and the listener thread sends an error to the request.
Each worker thread then evaluates the request against the server cache. If some or all of the data slices needed to fulfill a request reside in the server cache, the worker thread directly retrieves and fulfills as much of the request as possible from the server cache.
If the worker thread cannot complete the request from the server cache, it requests the needed number of process threads from the process thread pool.
If a worker thread cannot allocate enough process threads to support its requirements, additional process thread requests will be queued until process threads are available.
If query logging is enabled, the log thread records every nth query request, determined by the value of the QueryLogSampling registry setting, in a database for later reference.
As each process thread returns its result, the worker thread stores the information. When the last process thread concludes its process, the worker thread then returns the results of the request to the initiating application. Because the worker thread cannot return until all of its process requests have been concluded, the worker thread waits until all queued requests have been assigned to and processed by an available process thread. Queued processes are given priority over other incoming process requests until the process queue is empty.
If a worker thread is idle for a specified period of time, the listener thread can terminate it to conserve resources.
After a certain period of time or in memory conservation situations, the cleaner thread cleans unused and expired data from the server cache.

Server Settings

The following table describes the registry settings used to control thread management in Analysis Services. All of the settings reside in the registry at \HKEY_LOCAL_MACHINE \Software \Microsoft \OLAP Server\CurrentVersion.

Registry value	Purpose	Default value
BackgroundInterval	The number of seconds between background processing periods	0x0000001e (30)
InitWorkerThreads	The number of worker threads initially started by the listener thread	0x00000004 (4)
PoolProcessThreads	The maximum number of process threads maintained by the process thread pool	0x00000028 (40)
PoolWorkerThreads	The maximum number of worker threads maintained by the worker thread pool	0x0000001e (30)
ProcessThreads	The number of process threads that can simultaneously use the processors available on the Analysis server	2 per processor
QueryLogConnectionString	The connection string used by the log thread to connect to the query log database	Varies
QueryLogSampling	The number of queries used by the log thread to determine the sampling interval	0x0000000a (10)
WorkerThreads	The number of worker threads that can simultaneously use the processors available on the Analysis server	2 per processor

Typically, the value of the PoolProcessThreads registry setting should be set to 125 percent of the value of the PoolWorkerThreads registry setting, because each worker thread can allocate more than one process thread to complete a request. If the Analysis server consistently handles numerous and complex requests, raising the value of the PoolProcessThreads registry setting to 150 percent or more of the PoolWorkerThreads value will reduce queuing and improve perceived performance. If the Analysis server consistently returns errors during peak usage periods, increasing the value of the PoolWorkerThreads and PoolProcessThreads registry settings will provide more available worker threads and reduce the rate of errors related to non-available worker threads.

Only one setting can be directly changed through Analysis Manager, using the Properties dialog box. For more information about how to access the Properties dialog box, see SQL Server Books Online.

The following table details the tab and option in the Properties dialog box, as well as the corresponding registry value, used to change a specific registry setting.

Properties tab	Option	Registry value
Environment	Maximum number of threads	WorkerThreads

Client Thread Management

Client thread management is much less complex than server thread management. PivotTable Service uses only two threads to interact with an Analysis server:

Ping thread

The ping thread is a background thread used to support the synchronization process between the server data cache and the client data cache. After a set number of milliseconds, determined by the Auto Synch Period property discussed earlier, the version stamp for the data and meta data stored in the client cache is compared against the version stamp for the data and meta data stored in the server data cache. If the versions differ, the information in the client data cache is synchronized with the information in the server data cache.
Query thread

The query thread is a foreground thread used to support query requests between the client application and the Analysis server. When a query request is made, the query thread is used to first scan the client cache for data pertinent to the query request. If the data does not reside in the client cache, it then submits a query request to the Analysis server to retrieve and cache the desired data.

If the ping thread is turned off (by setting the Auto Synch Period property to zero), the query thread can still support caching and synchronization, but the process is no longer automatic; the latency period of stale data in the client cache is dependent on the elapsed time between submitted query requests.

Client Settings

The following properties can be used with PivotTable Service to control the behavior of client thread management.

Property name	Purpose	Default value
Auto Synch Period	The interval in milliseconds of synchronization between the client cache and the server cache by the ping thread	250

Processing Interaction

The act of processing a cube or dimension in Analysis Services requires a lot of interaction between the Analysis server and underlying relational databases. Processing cubes and dimensions in Analysis Services involves at least two of three phases. The first phase, or base phase, retrieves data from underlying data sources. The second phase, or indexing phase, constructs indexes for the retrieved data to speed data transfer. The last phase, or aggregating phase, scans the indexed data for aggregation, creating aggregations in memory and writing them to storage.

Whenever a dimension is processed, the following steps are performed:

Analysis Services queries the underlying relational database for unique members. The volume of data to be processed depends on the processing option selected and the state of the existing dimension.
The data is then resorted in memory for multidimensional access.
Dimension indexes, used to optimize member access from the dimension, are then created.
Both members and dimension indexes are then stored in memory (for shared dimensions) and in disk storage.

Whenever a partition is processed, the following steps are performed. (Every cube contains at least one partition, so processing a cube requires processing one or more partitions.)
Each dimension, shared and private, in an inconsistent state, is first processed following the previous set of steps for dimension processing.
Analysis Services reads the fact table for the partition, sorts the data in memory for multidimensional access, and stores the data on disk in segments of 64K rows each.
The data segments are read back and indexes are generated. In the same pass, aggregations are created in memory. A temporary file is used to swap aggregations into and out of memory if process buffer memory is insufficient to store all aggregations.
The fact table is then scanned, and aggregations are created in memory on a per partition basis. If process buffer memory is exhausted while storing aggregations, a temporary file is used to swap aggregations in and out of memory.
Indexes and aggregations are then stored on disk.

Processing does not always have to complete before users can access processed cube data. With the use of lazy optimization in Analysis Services, the indexing and aggregating phases can be made to return cube data as it is aggregated. More information about how to use this form of process optimization, and its ramifications, is discussed later.

Two types of process threads are allocated during processing. Both threads work in parallel, so Analysis Services can use multiple-processor server configurations more effectively:

Reader threads

Reader threads populate the process buffer with records retrieved from the underlying data source. Reader threads are used to read in records a block at a time, with the number of records per block specified in the ProcessRecordsReportGranularity registry setting, until a single segment of data is retrieved. Once retrieved, the segment is written to the partition file.

For more information about the ProcessRecordsReportGranularity registry setting, see "Memory Management" in this chapter.
Aggregation threads

Once a segment is available in the partition file, multiple aggregation threads aggregate the data and cache the aggregations in memory. If the processing operation can no longer afford memory, the aggregations are then written to alternating temporary files. As one segment is processed, the aggregations for that segment are written to the first temporary file. When the second segment is processed, the aggregations in memory for the second segment are merged with the aggregations in the first temporary file, and are written to the second temporary file. When a third segment is processed, the aggregations in memory from the third segment are merged with the aggregations in the second temporary file, and are written to the first temporary file. This process repeats until the last segment is aggregated, at which point all of the aggregations stored in the last-used temporary file are compressed and stored for use.

Because of the amount of data handled during processing, several factors can affect processing performance:

Hardware configuration
Cube and dimension design
Storage mode
Aggregation design
Schema optimization
Partition strategy
Processing option
Memory management

Querying Interaction

When a query is issued by a client application to Analysis Services, a lot of action happens in the background. The following steps are used to resolve a query from a client application.

The numbers in the previous diagram correspond to the steps detailed below:

PivotTable Service parses the query and, if the query references multiple data slices, the query is then broken apart into individual query requests.
The query processor parses the query requests and then retrieves dimension information from dimension memory to determine the levels and members needed to resolve the query requests.
PivotTable Service then checks the client cache to determine if query requests can be answered by cached results. Individual query requests that cannot be fulfilled from the client cache are sent to the Analysis server.
The query processor then retrieves as much data as possible from the query result cache for each query request that can be answered from the cached data.
For the query requests that cannot be answered from the query result cache, the query processor then retrieves the data for each query request from the appropriate partition. If applicable, the retrieved cube data is also placed in the query result cache.
PivotTable Service receives the returned data for all of the individual query requests and post-processes the data, if needed, to fulfill the original query.

Because of this multi-step process, several factors can affect querying performance:

Hardware configuration
Cube and dimension design
Storage mode
Memory management, including the relevance of query result cache and dimension cache content
Aggregation design, including the involvement of usage analysis
Partition strategy

Improving Overall Performance

The information in this section outlines the techniques common to improving both processing and querying performance. Several factors affect overall performance:

Hardware configuration
Dimension and cube design
Storage mode selection
Aggregation design
Schema optimization
Partition strategy

Hardware Configuration

When you choose a server configuration to support Analysis Services, you should follow several general guidelines. Three main factors—processors, memory, and disk storage—affect the overall performance of an Analysis server. Each is discussed separately, in detail, and recommendations are provided for improving overall performance.

Processors

Scaling up to multiprocessor servers allows for much greater Analysis server performance. Scaling up, however, does not necessarily provide a linear increase in performance. Other factors, such as physical memory or disk storage, can affect the increase provided by scaling up an Analysis server. Using a symmetrical multiprocessor (SMP) server, however, can significantly increase the performance of an Analysis Services solution at an acceptable relative cost.

An alternate way to distribute load through Analysis Services is to use distributed partitioned cubes and linked cubes. Linked cubes allow you, in essence, to alias cubes from other Analysis servers, while distributed partitioned cubes allow you to distribute partitions for a single locally administered cube across multiple Analysis servers. Both options provide a limited form of load distribution, but they significantly increase administration complexity and should not be used as a substitute for scaling up the Analysis server. Other scale out solutions can also be considered such as creating specialized data marts, each with its own Analysis server.

Memory

There are two items to consider regarding memory: the virtual address space of the Analysis Services process and the physical memory that backs that virtual space.

Microsoft Windows® 2000 and Windows NT generally limit a process to a 2GB virtual address space. This space can be extended to 3GB using the /3GB boot setting. This is recommended if Analysis Services is the primary use of the server machine and if large dimensions are involved. Within the address space, the service allocates memory for all uses: dimension storage, process buffers, and the query result cache, as described above. However, if any dimension exceeds the VLDM Threshold, it will be migrated to a separate process and will occupy a different virtual address space.

Note When the /3GB switch is used, the upper bound on memory must be set using the HighMemoryLimit registry setting, because the Memory conservation threshold property cannot be set greater than 2GB when using Analysis Manager.

System performance will be improved if the entire virtual address space can be backed by physical memory. This enables the system to avoid paging, the great time waster. Because the Analysis Services processes cannot use more than 3GB of address space, however, the only time a system would benefit from more than about 4GB of memory (3GB for Analysis Services + 1GB for the operating system and file system cache) is:

When VLDM is involved.
To support other applications.

Disk Storage

Disk storage planning is important for Analysis Services solutions, because even mid-range implementations of Analysis Services can involve large amounts of disk storage and disk access. Combined with the high performance and fault tolerance generally required for decision support solutions, a RAID (Redundant Array of Inexpensive Disks) array is an excellent choice for improving disk storage performance.

One of the most common methods of ensuring reliable and fast disk storage, RAID uses an array of disk storage devices, treating them as one disk storage device in terms of disk access. RAID can be implemented as a hardware solution, in the form of a RAID controller, or a software solution, such as the form of the Windows 2000 software RAID solution involving dynamic disks. RAID provides two benefits to overall server performance:

Fault tolerance

RAID protects against data loss due to disk storage device failure by distributing information across multiple disk storage devices, including information needed to recreate a disk storage device if it should fail.
Performance increase

Most RAID controllers divide data I/O operations evenly across all disk storage devices in the RAID array, effectively providing parallel I/O and increasing perceived performance.

For information about RAID, see Chapter 20, "RDBMS Performance Tuning Guide for Data Warehousing."

Recommendations

Scale up, and then scale out.

Analysis Services is designed to support multiple processors, assigning worker and process threads as needed to distribute processor load within a single server. A single multiprocessor Analysis server can support all but the most demanding enterprise-level implementations. Scaling out can be more difficult and adds significant administrative complexity; for Analysis Services, scaling up is generally more useful than scaling out.
Allocate physical memory as appropriate.

Overall Analysis Services performance can be greatly improved with a small increase in physical memory. This is especially true for processing operations, which can be quite memory-intensive. Avoid setting the HighMemoryLimit registry setting higher than the amount of available physical memory, to enforce the use of physical memory, and reduce the number of other active applications on the Analysis server. Ideally, to prevent crowding of Windows 2000, the HighMemoryLimit registry setting should be set to no more than approximately 90 percent of available physical memory; this is especially true for installations with 64 MB or less total physical memory.

Although 64 MB of physical memory is recommended, it is not unreasonable to install gigabytes of memory on a server. This enables the system to avoid paging, use large process buffers to avoid I/O in creating aggregations, and allocate a large query result cache so more queries can be answered directly from cache.
Use hardware or software RAID solutions to achieve the storage volume, disk speed, and redundancy needed.

Dimension and Cube Design

Designing dimensions and cubes to increase performance can be a highly subjective process; cubes and dimensions should be defined primarily by the requirements of underlying data and business scenarios.

A detailed discussion of dimension and cube design is beyond the scope of this chapter. For more detailed information about dimension and cube design, including a discussion of new features in Analysis Services, see Chapter 22, "Cubes in the Real World."

Storage Mode Selection

The storage mode selected for partitions and dimensions can greatly affect both processing and querying performance.

Multidimensional OLAP (MOLAP)

MOLAP storage stores both the underlying data used to create the cube and the aggregations created for the cube in a multidimensional structure maintained on the Analysis Server. Because Analysis Services can answer all queries directly from this multidimensional structure, this storage mode presents the best overall querying performance of all three storage modes.

Because the underlying data used to create the cube is copied from the relational database to this multidimensional structure, the storage requirements needed to support MOLAP can be the largest of the three storage modes. However, for MOLAP storage Analysis Services compresses the data to about 20 to 30 percent of its original storage requirements in the source database tables (excluding relational database indexes, which are not copied to MOLAP storage). Thus, the storage penalty for MOLAP is not as severe as might be expected and becomes a factor only when very large amounts of data are processed.

Relational OLAP (ROLAP)

On the other end of the spectrum, ROLAP stores the aggregations created for the cube in the same relational database that provides the underlying data for the cube. This involves the construction of aggregation tables in the relational database, as well as large table joins to reference fact and dimension tables stored in the database.

Because Analysis Services must query the relational database to retrieve all data for a partition or dimension that uses ROLAP, this storage mode provides the slowest querying performance of the three storage modes in Analysis Services. ROLAP can also have significantly slower processing performance than the other storage modes because new tables and numerous table joins must be created in order to store aggregations in the relational database. ROLAP also incurs a significantly higher load on the relational database than the other two storage modes because of the potentially immense joins between the fact and dimension tables from which the aggregation tables are generated.

ROLAP typically uses less additional storage than MOLAP, but more than HOLAP; although underlying data is not copied, the aggregations stored in the relational database are not compressed.

ROLAP, however, has some features that make it useful as a storage mode. ROLAP storage allows for very large or huge dimensions, because dimension size is not limited by memory address space. Also, when used with SQL Server 2000 relational databases, ROLAP allows real-time OLAP for Analysis Services, because indexed views can be constructed for ROLAP dimensions and partitions on the relational database and used as underlying data, and the relational database can notify Analysis Services when data changes.

Hybrid OLAP (HOLAP)

HOLAP combines elements from both MOLAP and ROLAP storage modes. As with MOLAP, aggregations are stored on the Analysis server in multidimensional storage. However, the underlying data for a cube is not also copied into multidimensional storage. Similar to ROLAP, underlying data is accessed from the relational database.

Because underlying data does not need to be copied to the Analysis server, HOLAP often provides the best processing performance and uses the least additional storage. Query performance is typically better than ROLAP but not as good as MOLAP, depending on aggregation design. Queries that can be answered with data from aggregations will perform as well in HOLAP as MOLAP because access to relational tables is not required. On the other hand, queries that require access to data stored in the relational database to retrieve individual facts or to create on-demand aggregations can be as slow in HOLAP as in ROLAP.

Recommendations

Use MOLAP before HOLAP, HOLAP before ROLAP.

You should use MOLAP wherever feasible, unless business or data requirements dictate otherwise.

HOLAP is best used wherever MOLAP is recommended but is prohibited due to storage requirements.

ROLAP should only be used to support dimensions that are too large for other storage modes, or if real-time OLAP is needed.

Aggregation Design

Designing aggregation storage involves tradeoffs between processing time, disk storage, and query response time. Processing time and disk storage increase as more aggregations are designed, but more aggregations increase the probability that queries can be answered from precalculated aggregations. Also, due to the nature of multidimensional structures, the disk storage for aggregations can increase exponentially as more aggregations are designed. For more information on how aggregation affects disk storage, see Chapter 22, "Cubes in the Real World."

In general, minimal aggregations should be designed initially and additional aggregations created in response to analysis of typical user queries. For very large and complex cubes that contain numerous dimensions with many levels, initial aggregation design will take a long time, consume large amounts of disk space, and will be less likely to provide a mix of aggregations that can be used by many queries. A more effective approach in this case is to design zero aggregations and then iteratively use the Usage Optimization wizard to create aggregations based on several sessions of typical user activity.

Partition strategy also complements overall aggregation design. For information on how partition strategy affects aggregation design, see "Partition Strategy" later in this chapter.

Recommendations

A balance between disk storage and processing time can be achieved with a few storage design guidelines:

Start at 25 percent performance gain, and then optimize aggregations based on usage.

The Storage Design Wizard can be used to initially design aggregations based on a mathematical model of uniform usage. Use the Storage Design Wizard to design aggregations based on an anticipated 25 percent performance gain. Allow users to access the cube for a specified period of time, and then use the Usage-Based Optimization Wizard to perform usage-based analysis and redesign aggregations based on real world data. For very large and complex cubes that contain numerous dimensions with many levels, start with zero aggregations.

Schema Optimization

Analysis Services uses solid, dependable techniques to ensure that the cubes and dimensions you design are stable and reliable. However, this also means that Analysis Services automatically takes the safest possible path when designing the queries used to populate cube and dimension data from underlying data sources. After the design of a cube has become stable, you can use schema optimization to reduce the number of joins required when processing.

For more information about the effects of schema optimization, including examples, see Chapter 22, "Cubes in the Real World."

Partition Strategy

The organization of files used to store cube data, referred to as partitions, can make a great deal of difference in the performance and maintainability of cubes in Analysis Services. Cubes can use either a single partition, in which all fact table and aggregation data are stored in a single file, or multiple partitions, in which fact table and aggregation data can be logically and physically divided into multiple files.

Within a cube, different partitions can have different data sources, source tables, aggregation designs, and even storage modes. This allows a great deal of flexibility when managing large cubes. Using multiple partitions offers the following performance and maintenance benefits:

Historical data management

The technique for managing historical data using partitions in Analysis Services is no different, in theory, than the technique used for managing historical data using partitioned tables in data warehousing. The rolling window technique can be used to keep a set number of partitions, storing data slices based on a time dimension, to maintain a set number of time periods. After a time period expires, the old partition can be removed and a new partition can be added to maintain the rolling window.
Querying performance

Queries executed against a cube are separated into individual query requests, each representing a data slice. If the data for the cube is divided into multiple partitions, less I/O is potentially required to retrieve data from the cube.

For example, a financial cube contains 13 partitions, each representing a monthly data slice derived from a time dimension. If a user issues a query requesting all of the sales figures for a particular department for a single month, the retrieval process does not need to scan all 13 partitions; because the data slice resides in a single month, only a single partition needs to be scanned. If the cube stored all 13 months of data in a single partition, the entire partition would need to be scanned.

Aggregation designs can be different for different partitions. If you have a cube with both current and historical data, and the current data represents 80 percent of total query interaction, you can use a much more aggressive aggregation design on the partitions that contain current data as compared to the partitions that contain historical data. This concept also applies to storage modes—for example, you can use MOLAP for the partitions that contain current data, and HOLAP or even ROLAP for the partitions that contain historical data, conserving disk storage on the Analysis server.

Partition support, for querying purposes, is transparent. The user is not aware of the underlying partition model while querying the cube.
Processing performance

Partitions can be processed individually, allowing for better selective processing and reducing the total amount of time needed to process a cube. For example, a financial cube with 13 partitions, each representing a month of financial data, does not need full processing for all 13 partitions—only one partition needs to be fully processed every month.

Also, the ability to tailor aggregation design and storage modes for each partition allows you to manage disk storage and server usage more effectively. Data that is infrequently accessed can be relegated to HOLAP or ROLAP storage with few or no stored aggregations, minimizing disk storage. Frequently accessed data, by comparison, can use MOLAP or HOLAP storage and contain a large number of stored aggregations, increasing performance and decreasing server usage by reducing the number of aggregations computed at query time.

The techniques used to construct and maintain cube partitions are similar to the techniques used to construct and maintain partition tables in a data warehouse.

For more information about partition strategy in a data warehouse environment, see Chapter 18, "Using Partitions in a SQL Server 2000 Data Warehouse."

Recommendations

Slice your data based on business requirements.

Maximum querying performance is obtained by constructing partitions with data slices that mirror the data slices that are required for business support. For example, a financial cube typically tracks data as a time series and most queries will retrieve data based on time period, so partitioning the cube by time period provides the most performance benefit.

The business requirements for the data represented by the cube will generally dictate the structure and data slice for each partition.

Improving Processing Performance

Processing performance is almost entirely dependent on memory availability, but it is also affected by CPU speed and disk storage speed. There are additional considerations that affect processing performance above the considerations that affect overall performance.

Processing Options

You do not have to fully process cubes and dimensions in Analysis Services every time underlying data changes. Several different options are available to you when processing, and using the right processing option can mean measuring the time needed to process a cube in minutes, rather than hours or days.

Most shared dimensions seldom require full processing. Full processing involves the complete reconstruction of a dimension, and should occur only if the hierarchy of a dimension changes in some way, such as the addition or deletion of a level. Adding new members or changing existing members does not alter the hierarchy. Private dimensions require full processing more often, because their structure is inextricably linked to the structure of their cube.

Correspondingly, most cubes do not require full processing once operational. As with dimensions, cubes should only be fully processed if the structure of the cube changes. Adding or removing dimensions or measures, changing partitions or aggregations, and so on require full processing of the cube. Additions to underlying fact table data, by contrast, only require an incremental update of cube data.

Recommendations

Use selective processing.

Use the incremental update option when processing dimensions and cubes, unless structural changes have occurred. Use full processing only when absolutely necessary.
Incrementally update dimensions when updating cubes.

Take advantage of this option when processing cubes—you can streamline the maintenance of most Analysis Services solutions in this manner.
Take advantage of lazy processing where appropriate.

Lazy processing essentially turns the indexing and aggregating processing phases into background operations, allowing users to continue accessing the cube while these phases are completed. This option is not appropriate in all circumstances, because queries will run slower until the background processing is completed, but it can increase cube availability for users.
Ensure enough disk storage for shadow copies and temporary files.

During the indexing and aggregating phases, shadow copies and temporary files are often needed to maintain aggregation data during processing. Processing can fail if there is not enough disk storage for temporary files and aggregation data.

Memory Requirements

The memory requirements for processing are different than those for querying. Processing cubes and dimensions, especially large cubes and dimensions, can easily become overwhelming in terms of physical and virtual memory if no restrictions are placed on memory allocation.

Process buffer memory, as discussed earlier, is allocated at a finite limit; no matter how many processing operations there are, the process buffer allocation cannot exceed the limit established by the ProcessReadSegmentSize registry setting. A set of economic model algorithms, discussed earlier, are used to manage the process buffer, ensuring that those processing operations that require priority receive an optimized amount of process buffer memory automatically.

Process buffer size is central not just to the raw available capacity during processing, but also to the resultant overall performance of the partition once processed. During the indexing phase, the process buffer is used to sort fact table data and construct indexes based on the results of the sort. As such, partition access may be adversely affected if the process buffer is unable to efficiently sort the fact table data.

Recommendations

Set the ProcessReadSegmentSize registry setting based on available physical memory.

This value should be set according to the needs of your Analysis server. A good rule of thumb is to set this value to between 70 and 80 percent of the total physical memory available to the server. Avoid the temptation to set it higher than 80 percent, because this may force memory to be paged to disk. Setting it to 100 percent or higher will definitely cause forced memory paging, and will most likely cause your Analysis server to thrash, or spend more time in swapping memory to and from disk than in executing applications.
Increase the memory available to Analysis Services.

Use the Windows /3GB boot setting and adjust the Analysis Services HighMemoryLimit registry setting.

Storage Requirements

Disk storage issues can adversely affect processing performance in Analysis Services. Disk storage is used not only to store data and meta data for Analysis Services objects, but also for temporary files used during processing.

Recommendations

Remember the effects of data explosion.

Data explosion can dramatically increase the amount of disk storage needed for storing aggregations. Compression helps conserve disk storage, but minimizing aggregations to save disk storage penalizes users by increasing query response time.
Plan disk storage space for shadow copies and temporary files.

Ensure that enough free storage space is available to completely copy the largest single partition stored on an Analysis server, and add enough available storage to maintain two complete copies of aggregation data for the largest single partition stored on the server.
Err on the side of caution.

Provide as much storage as possible to permit growth—new cubes or additional partitions for existing cubes are not uncommon as an Analysis Services solution matures. Having too much disk is far preferable to having too little.

Improving Querying Performance

While improving processing performance directly benefits administrators, this is only half of achieving optimal performance. Good performance while querying the Analysis server is also crucial to the success of any Analysis Services solution, and directly benefits clients.

Memory Requirements

Query performance depends heavily on whether data relevant to the query already resides in the query result cache maintained by the Analysis server, or in the client cache maintained by PivotTable Service.

Ideally, the best possible performance for a specific query is obtained by having all data needed to fulfill the query cached on the client. No round trip to the server is needed; PivotTable Service is fully capable of analyzing a query against the client cache to determine if server data is needed.

Realistically, memory and usage limitations typically prevent this from happening. Instead, a combination of client cache retrieval, server cache retrieval, and disk storage retrieval is usually needed to fulfill a stream of queries, in differing proportions based on the state of both the client cache and the server cache.

The most striking effect on query speed is the comparison between a cold cache and a warm cache. A cold cache occurs when the Analysis server first starts up; because no queries have yet been processed, no data is cached. After enough queries execute to build up a common set of cached data, a warm cache can provide significantly faster performance; no storage interaction is needed for cached data.

One of the easiest steps to take to reduce the effects of a cold cache is to create a simple application, executed every time the MSSQLServerOLAPService service is started, to execute the most commonly used MDX queries against the Analysis server. This, in essence, preheats the server cache with the most commonly used data. To obtain the most commonly used MDX queries, the Properties dialog box from Analysis Manager can be used to change the logging habits of the Analysis server. From the query log information, suitable MDX queries can be constructed.

Preheating the cache can also be useful on the client in certain situations. By executing the same commonly used MDX queries from the client application as well, the client cache is populated with commonly used cube data. Camouflaged during the startup sequence of a typical custom client application, this technique can significantly increase the perceived performance of the client application without extensive coding.

Maximum memory allocation for the server query result cache can be determined by subtracting the memory allocated for all other buffers from the HighMemoryLimit registry value. This information can then be used to determine if allocation levels for the other buffers can be altered to balance querying and processing performance.

For more information about the various registry settings used to improve querying performance, see "Memory Management" and "Thread Management," in this chapter.

Usage Analysis and Aggregation Design

The Storage Design Wizard allows for basic aggregation design, based on a mathematical model that assumes that one query request is as likely as another query request in terms of usage. This uniform model allows for quick construction of aggregation, but does not always produce the optimum aggregation design.

Realistically, the only way to accurately optimize the aggregation design for a cube is to allow users to access the cube, gathering historical data on query requests, and then use the historical data to model the actual usage patterns for the cube. Once the usage patterns are determined, aggregations can be constructed to more exactly match business needs.

The Usage-Based Optimization Wizard automates the process of usage pattern analysis. Usage-based optimization can be performed for a variety of query patterns. Queries can be analyzed by date, frequency, user, and data storage access, or any combination of these criteria, to determine an optimal aggregation design for a specific set of query requests.

The criteria used will also be determined by business requirements. For frequently used cubes available to a wide audience, usage patterns based on frequency, the most commonly executed query requests, are most often analyzed. For infrequently used cubes, usage patterns based on time may be more applicable. Cubes designed for a specific group of users may benefit as much from a user based usage pattern analysis as from a frequency based usage pattern analysis.

Ultimately, the best way to determine the optimal usage pattern analysis is through experimentation. Initially, design aggregations using the Storage Design Wizard, allowing for 25 percent performance gain, and then expose the cube to your audience. (For very large and complex cubes, start with zero aggregations.) Collect usage information, and then analyze it based on your understanding of the business behavior of your audience.

Recommendations

Allow sufficient time to record usage pattern information.

As with any other pattern analysis, more data typically provides better analysis. Attempting to model usage patterns from a few days of usage data will probably not produce an optimum solution. For example, cubes used by financial applications will typically see uniform usage patterns during the middle of a financial period, slightly wider usage patterns during the beginning of a financial period, and much wider usage patterns near the end of a financial period. Allowing for at least one month of usage data to be accumulated for such cubes will provide a better insight into the actual usage patterns of the cube, producing a better aggregation design.

Allow your business requirements to dictate the time needed to record usage before usage-based optimization is performed.
Remember that usage-based optimization is on a per partition basis.

Not all partitions need to have the same aggregation design. When you perform usage-based optimization, the recommended changes are executed only for a single partition, not on all of the partitions across the cube. For cubes with multiple partitions, the Usage-Based Optimization Wizard should be executed for each partition to model an effective aggregation design for the entire cube.

Evaluating Performance

While a discussion on improving performance is useful, and providing recommendations is also useful, implementing such recommendations without the tools necessary to evaluate their relative success is, at best, a process of educated guesses.

The Windows 2000 System Monitor, part of the Windows 2000 Performance Tool, is an excellent utility for examining the performance of an Analysis server, either by obtaining a snapshot of current performance or by reviewing historical performance. To this end, Analysis Services provides several performance objects, each representing several performance counters, to track the minutiae of Analysis server activities. Combined with the performance counters provided to track system performance, the Windows 2000 System Monitor can be used to assist in fine-tuning overall Analysis Services performance on a server-by-server basis.

Analysis Services Performance Counters

Analysis Services provides over 140 of performance counters, measuring almost every possible metric within the Analysis server. Each of the performance counters belongs to one of three separate groups:

Current

All of the counters in this group provide a current view of performance information. The latency of the information provided by the counters in this group is subject to system conditions, such as server load or network latency, and should be viewed as a snapshot of the current system behavior.
Total

All of the counters in this group provide a total of performance information, accumulated from the start of the MSSQLServerOLAPService service.
Ratio

All of the counters in this group provide a ratio of performance information, typically over time. For many ratio counters, this information is constructed by averaging the values of total counters over the time elapsed from the start of the MSSQLServerOLAPService service.

Although all of the performance counters are useful, many are difficult to directly relate to system performance. This section focuses on performance counters that can be directly related to system performance, including performance counters that demonstrate the impact various registry settings can have on the Analysis server.

Each performance counter in Analysis Services belongs to a specific performance object; to access a specific performance counter, you must first select its performance object in the Windows 2000 System Monitor.

For more information on how to use the Windows 2000 System Monitor, see "System Monitor" in Windows 2000 documentation, or "Windows 2000 System Monitor" in the Platform SDK section of the MSDN® Online Library.

Analysis Server:Agg Cache

The performance counters contained in this performance object monitor the server query result cache, examining such metrics as size and throughput. The following counters are best used to examine cache performance:

Current bytes

The current number of bytes also allows you to examine memory usage for the server data cache. Monitored over a period of time, this value can indicate whether or not you have additional available memory for expanding other buffers, such as the process buffer or the read-ahead buffer, to increase processing performance.
Direct hits/sec

The number of data slices that are fully completed by the server cache. The higher this number is, the better the perceived performance.
Filter hits/sec

The number of data slices that are partially completed by the server cache. This number should be much higher than the Misses/sec performance counter, but not as high as the Direct hits/sec performance counter.
Lookups/sec

The total number of query requests that are reviewed by the server cache. This number should represent the combined total of the Direct hits/sec, Filter hits/sec, and Misses/sec performance counters.
Misses/sec

The number of data slices that are not completed by the server cache, requiring retrieval from disk storage. This number should be as low as possible.

The last four performance counters describe how effective the cache is in responding to query requests. A high number of direct hits and lookups per second, as well as a low number of filter hits and misses per second, indicates that the cache is being well-utilized. The concept of a preheated cache works well here, increasing the Direct hits/sec and Filter hits/sec values and reducing the Misses/sec value by making more cached data immediately available on server startup.

Analysis Server:Connection

The Connection object essentially measures the performance of the listener and worker threads. The following counters are best used to examine connection performance:

Current connections

Each connection has a worker thread assigned to it, so this performance counter is an excellent metric for observing current worker thread allocations.
Failures/sec

This performance counter represents the number of failed connection requests, for whatever reason, made to the listener thread. One of the reasons a connection request can fail is the lack of available worker threads, so this performance counter can be an indication, if consistently high, of a too-small worker thread pool.
Requests/sec

A basic performance counter, this indicates the total number of connection requests made to the Analysis server per second. This counter is generally useful as a basic measure of work performed by the Analysis server.

Analysis Server:Last Query

The Last Query performance object is best used to observe bottlenecks in the steps used to answer a single query.

Data bytes

The number of bytes read from the data file, used to store aggregation data for a cube. If this number is inordinately high when the query request is executed against a warm cache, this can indicate less than optimal cache management.
Data reads

The number of read operations needed to read data from the data file. As with the Data bytes performance counter, inordinately high values for this performance counter can indicate less than optimal cache management.
DSN requested

The data set name, represented as the ordinal number of each level needed from each dimension in a cube requested to satisfy a query request. This information is also used during the usage analysis process in aggregation design.
DSN used

The data set name, represented as the ordinal number of each level needed from each dimension in a cube actually used to satisfy a query request. This information is also used during the usage analysis process in aggregation design.
Index bytes

The number of bytes read from the index file, used to locate members within dimensions.
Index reads

The number of read operations needed to retrieve index information.
Map bytes

The number of bytes read from the map file, used to locate cells within the cube.
Map reads

The number of read operations needed to read data from the map file.
Rows created

The number of rows created in memory to answer a query request. Combined with the Rows read performance counter, this can be used to measure the effectiveness of aggregation design for a given cube.
Rows read

The number of rows read from the data file to answer a query request. Combined with the Rows created performance counter, this can also be used to measure aggregation design effectiveness.
Time (ms)

A useful metric from any viewpoint, this performance counter measures the total elapsed time, in milliseconds, to answer a query request. For this performance counter, lower is definitely better.

Analysis Server:Locks

The Locks performance object tracks the number and performance of read and write locks managed by the Analysis server. Although this can be useful, for example, in determining deadlocking behavior, this performance object is generally not needed for reviewing overall Analysis server performance.

Analysis Server:Proc

The Proc performance object monitors the base phase of processing, providing detailed information for reviewing processing performance.

File bytes written/sec

This counter represents the number of bytes written to multidimensional storage.
File rows written/sec

This counter, similar to the File bytes written/sec performance counter, represents the number of rows written to multidimensional storage. Used together, they can assist in locating bottlenecks caused by disk storage on the Analysis server.
Memory size bytes

This counter represents the size of memory, in bytes, needed to contain the rows currently being resorted in memory during the base phase.
Memory size rows

This counter represents the estimated number of rows in memory currently being resorted during the base phase. Combined with the Memory size bytes performance counter, this counter can assist in determining process buffer utilization during the base phase.
Rows created/sec

This counter represents the rate of aggregated rows created in multidimensional storage. Although this counter should be close to the Rows read/sec counter, it may be lower because of duplication in incoming rows.
Rows merged/sec

This performance counter shows the rate of rows merged from the underlying data source into multidimensional storage. This figure should be as close as possible to the Rows read/sec performance counter for optimal performance.
Rows read/sec

This performance counter shows the rate of rows read from the underlying data source by the reader thread during the base phase. Combined with the Rows created/sec and Rows merged/sec counters, this can indicate if bottlenecks occur at the relational database or at the Analysis server during processing.

Analysis Server:Proc Aggs

The Proc Aggs performance object monitors the aggregating phase of processing, providing detailed information for reviewing processing performance.

Memory size bytes

This counter represents the size of memory, in bytes, used to contain aggregations being created in memory.
Memory size rows

This counter represents the estimated number of aggregations in memory currently being created. Combined with the Memory size bytes performance counter, this counter can assist in determining process buffer utilization during the aggregating phase.
Rows created/sec

This counter represents the rate of aggregated rows created in multidimensional storage. Although this counter should be close to the Rows read/sec counter, it may be lower because of duplication in incoming rows.
Rows merged/sec

This performance counter shows the rate of aggregation rows merged from the fact table into multidimensional storage. This figure should be as close as possible to the Rows read/sec performance counter for optimal performance.
Temp file bytes written/sec

This counter represents the number of bytes written, per second, to temporary files used to hold aggregations during the aggregating phase. Ideally, this counter should have a value of zero—all aggregations should take place in memory. Processing performance significantly degrades if temporary files are required, so this performance counter is an excellent metric for determining if additional process buffer memory is needed.

Analysis Server:Proc Indexes

The Proc Indexes performance object monitors the indexing phase of processing, also providing detailed information for reviewing processing performance. The indexing phase is not as memory-intensive as the base and aggregating phases, however.

Rows/sec

This performance counter shows the rate of rows read from multidimensional storage by the reader thread during the indexing phase. A low rate can indicate a problem with disk storage on the Analysis server.

Analysis Server:Query

The Query performance object represents overall Analysis server querying performance. The following counters can assist in determining issues in querying interaction, with a focus on thread management.

Remember, however, that a query request on the Analysis server does not necessarily correspond on a one-to-one basis with MDX queries issued by client applications. Each query request represents a request to retrieve data for a single data slice, and an MDX query can represent many data slices.

Avg time/query

A basic performance counter, this indicates the average elapsed time, in milliseconds, per query request. Used in conjunction with the Time (ms) performance counter in the Last Query performance object, relative query performance can be quantified for very complex queries.
Current process thread pool

This counter represents the current number of process threads, both active and idle, in the process thread pool.
Current process thread queue length

This counter represents the current number of queued process requests. Ideally, this counter should be zero—there should always be enough process threads in the process thread pool to handle process requests. Realistically, this number should be as low as possible; consistently high numbers in this counter indicate that the number of threads allocated for the process thread pool is too low.
Current process threads active

This counter represents the number of active threads in the process thread pool. Subtract the value of this counter from the value of the Current process thread pool counter to determine the number of idle process threads; if this number is consistently high, too many process threads may be allocated.
Current threads

This performance counter represents the total number of both worker and process threads actively working on query requests.
Current worker thread pool

This counter represents the current number of worker threads, both active and idle, in the worker thread pool.
Current worker threads active

This counter shows the number of active threads in the worker thread pool. Ideally, this number should never be equal to the number of threads allocated to the worker thread pool; if all worker threads are active, there are none available to answer new query requests, causing errors to be returned by the listener thread to client applications. Realistically, this number should be between 50 to 75 percent of the total number of worker threads allocated for the worker thread pool, evaluated over time. If this number is consistently over 75 percent, or if it hits 100 percent, you may need to allocate more worker threads. If this number is consistently less than 50 percent, you may be able to reduce the number of allocated worker threads.

Analysis Server:Query Dims

The Query Dims performance object tracks the particulars of dimension data retrieval. Typically, the performance counters maintained by this performance object are not as useful for tracking overall performance, but a few of the counters can provide clues to dimension cache behavior and possible querying and processing performance bottlenecks:

Requests/sec

This performance counter tracks the number of requests made for members or member properties per second. This performance counter is directly related to the HugeLevelThreshold registry setting; if the setting is set too low, this counter will be correspondingly high. Increasing the value of the HugeLevelThreshold registry setting will decrease the value of this performance counter, at a corresponding cost in dimension cache memory.
VLDM requests/sec

This performance counter tracks the number of requests made for members or member properties, per second, to the Very Large Dimension Manager (VLDM). Since very large and huge dimensions are not cached in the same way that other dimensions are cached, using very large and huge dimensions can slow performance. This performance counter can assist in quantifying the effect of very large and huge dimensions on overall performance.

Analysis Server:Startup

Although all of the performance counters in the Startup performance object are interesting, only one is useful for quantifying server performance:

Server uptime

This performance counter represents the elapsed time since the MSSQLServerOLAPService service was started, and is used as the time over which total counters are averaged to construct ratio counters.

System Performance Counters

In order for Analysis Services performance counters to prove useful, they should be compared not only with each other for relative Analysis server performance, but also with overall system performance counters to determine how well Analysis Services is utilizing system resources. The most relevant system performance objects and counters are discussed in this section.

Memory

Although all of the counters in the Memory performance object are useful, two stand out when measuring Analysis Services overall performance:

Pages/sec

This performance counter indicates the number of I/O operations needed to support virtual memory. Ideally, this number should be as low as possible; a high number demonstrates too little available physical memory. Increasing available physical memory should reduce the number of page faults, and therefore reduce the amount of virtual memory used to support active processes such as Analysis Services.
Available bytes

This performance counter indicates the amount, in bytes, of available physical memory. Combined with the Pages/sec system counter, this counter can be used to further quantify the amount of available physical memory.

Network Interface

The Network Interface performance object monitors network traffic, tracking a number of metrics for performance evaluation. One counter, in particular, can be used to isolate possible bottlenecks in processing and querying performance:

Bytes Total/sec

In order to use this counter, you must select a network interface for monitoring purposes. This counter represents the total number of bytes that are sent and received, per second, for a given network interface.

The best use of this performance counter is to compare it with the various ratio counters, maintained in the Analysis Server:Proc, Analysis Server:Proc Indexes, and Analysis Server:Proc Aggs performance objects, used to track data transfer, such as the Rows read/sec counter in the Analysis Server:Proc performance object. For underlying data sources that are not resident on the Analysis server, if this counter matches the fluctuations of the ratio counters during poor processing performance, the network may be the cause of processing performance impact.

PhysicalDisk

Disk storage performance is central to Analysis Services performance. The following counters can be compared directly with various performance counters maintained by the Analysis Services performance objects to determine if disk storage represents querying or processing performance impact.

A common cause of poor disk storage performance is the performance of other applications running on the system. Disk storage supports all applications running on a given system; active applications draw resources away from Analysis Services. These performance counters can provide a better picture of absolute querying and processing performance by comparing all of the disk storage activity on the system with the disk storage activity tracked by Analysis Services.

Avg. Disk Bytes/Read

This represents the average number of bytes transferred from disk storage during a single read operation.
Current Disk Queue Length

This counter represents the current number of queued disk operations. If this number spikes during poor processing performance, especially during the base or aggregating phases, then the current disk storage solution may not be adequate to support the needs of Analysis Services. Ideally, the value of this performance counter should be as low as possible at any given time.

The following performance counters are especially useful for determining total disk storage system performance, to be compared with the relative performance information tracked by Analysis Services:

Disk Bytes/sec
Disk Reads/sec
Disk Transfers/sec
Disk Writes/sec

For example, the Disk Writes/sec can be compared with the File Rows Written/sec performance counter from the Analysis Server:Proc performance object to determine the percentage of disk writes established during the base phase of processing, compared to other applications on the server.

Process

To examine system performance for Analysis Services, you must first select the correct instance. For Analysis Services, the msmdsrv instance represents the MSSQLServerOLAPService service process. Once selected, you can view Analysis Services overall performance from the viewpoint of the operating system.

The following performance counters can be used to generally review Analysis Services performance:

% Privileged Time
% Processor Time
% User Time
IO Data Bytes/sec
IO Data Operations/sec
IO Other Bytes/sec
IO Other Operations/sec
IO Read Bytes/sec
IO Read Operations/sec
IO Write Bytes/sec
IO Write Operations/sec
Thread Count
Virtual Bytes
Working Set

The Thread Count performance counter, for example, can be compared against the various thread performance counters in the Analysis Server:Query performance object to determine thread behavior for additional threads in the Analysis server.

Processor

The performance counters represented by the Processor performance object track performance at the processor level, but typically at too low a level to be directly relevant. However, one performance counter can be used in a general fashion to determine overall performance:

% Processor Time

This performance counter represents the percentage of time spent processing non-idle threads, either for the entire system or on a processor-by-processor basis. A general indicator of processor usage, this can be compared against other Analysis Services performance counters to determine processor impact. Additionally, a consistently high processor usage may indicate a bottleneck in calculating aggregations.

System

The System performance object contains general system performance counters. Several performance counters can be used, when compared with Analysis Services performance counters for a relative viewpoint of overall performance:

Context Switches/sec

This counter indirectly indicates the efficiency of thread allocation for all processors, per second, for Analysis Services. Compared against the various thread performance counters in the Analysis Server:Query performance object, a view on overall system thread management can be established.
File Control Operations/sec

This counter tracks the number of non-read, non-write I/O operations performed by the disk storage system per second, and directly indicates the efficiency of the disk storage system. Compared against the various counters used to track disk storage I/O in Analysis Services, this counter can be used to reveal bottlenecks involving disk storage.
Processor Queue Length

This counter tracks the number of queued threads waiting for processor time. Compared against the various active thread performance counters in Analysis Services, this counter can be used to examine thread management efficiency.
System Calls/sec

Another basic performance counter, this counter tracks the number of Windows NT system service routines called per second. This counter is generally useful, when compared against the various performance counters in Analysis Services, to isolate possible operating system bottlenecks.

Chapter 26 - Performance Tuning Analysis Services

Introduction

Why Use OLAP?

Architecture

Overview

Memory Management

Server Memory Management

Economic Model Management

Server Settings

Client Cache Management

Client Settings

Thread Management

Server Thread Management

Worker and Process Thread Interaction

Server Settings

Client Thread Management

Client Settings

Processing Interaction

Querying Interaction

Improving Overall Performance

Hardware Configuration

Processors

Memory

Disk Storage

Recommendations

Dimension and Cube Design

Storage Mode Selection

Multidimensional OLAP (MOLAP)

Relational OLAP (ROLAP)

Hybrid OLAP (HOLAP)

Recommendations

Aggregation Design

Recommendations

Schema Optimization

Partition Strategy

Recommendations

Improving Processing Performance

Processing Options

Recommendations

Memory Requirements

Recommendations

Storage Requirements

Recommendations

Improving Querying Performance

Memory Requirements

Usage Analysis and Aggregation Design

Recommendations

Evaluating Performance

Analysis Services Performance Counters

Analysis Server:Agg Cache

Analysis Server:Connection

Analysis Server:Last Query

Analysis Server:Locks

Analysis Server:Proc

Analysis Server:Proc Aggs

Analysis Server:Proc Indexes

Analysis Server:Query

Analysis Server:Query Dims

Analysis Server:Startup

System Performance Counters

Memory

Network Interface

PhysicalDisk

Process

Processor

System

Additional resources