Analysis Services 2005 Processing Architecture
T. K. Anand
SQL Server 2005 Analysis Services
Summary: Processing is the operation in which the Analysis server reads data from the relational data source and populates the cubes, dimensions, mining models, etc. This whitepaper describes the Analysis Services 2005 processing architecture in detail and provides guidance on how and when to use the various processing controls. (22 printed pages)
Processing is a critical and resource intensive operation in the data warehouse lifecycle and needs to be carefully optimized and executed. Analysis Services 2005 offers a high performance and scalable processing architecture with a comprehensive set of controls for database administrators. This includes:
- A rich set of processing options
- Fine-grained parallel processing
- Batch processing
- Lazy processing of indexes and aggregations
- Push-mode processing
The XMLA (XML for Analysis) protocol is natively supported by Analysis Services 2005 and it is the only protocol for sending commands to the server. Analysis Services 2005 introduces several XMLA extensions for server administration. This section discusses the XMLA commands that are relevant to processing.
The Process dialog in SQL Server Management Studio builds an XMLA processing command and sends it to the server. By using the Script option, the XMLA command can be captured.
The Process command tells the Analysis server to process an object. It has the following syntax:
<Process> <Object>object_reference</Object> <Type>processing_type</Type> <Bindings>ool_bindings</Bindings> <DataSource>ool_ds</DataSource> <DataSourceView>ool_dsv</DataSourceView> <ErrorConfiguration>error_config</ErrorConfiguration> <WriteBackTableCreation>wb_tbl_creation</WriteBackTableCreation> </Process>
The Object element contains an object reference that specifies which object to process. The Type element specifies the type of processing to be performed (full, incremental, etc.) Both of these elements are mandatory. The rest are optional.
The Bindings, DataSource and DataSourceView elements are out of line bindings and will be discussed later in this document.
The ErrorConfiguration element specifies how to handle data integrity errors during processing. It overrides the ErrorConfiguration on the object (if any). A complete discussion of data integrity error handling is outside the scope of this document.
The WritebackTableCreation element applies only to writeback partitions and specifies whether to create a new writeback table or use an existing one. It can take three possible values—Create, CreateAlways and UseExisting.
Following is an example of a Process command that does a full process on a partition.
<Process xmlns="http://schemas.microsoft.com/analysisservices/2003/engine"> <Object> <DatabaseID>Adventure Works DW</DatabaseID> <CubeID>Adventure Works DW</CubeID> <MeasureGroupID>Fact Internet Sales</MeasureGroupID> <PartitionID>Internet_Sales_2004</PartitionID> </Object> <Type>ProcessFull</Type> </Process>
The Batch command is a convenient way to send multiple commands to the server in a single XMLA request. It has the following syntax:
<Batch Transaction="true" ProcessAffectedObjects="false"> <Bindings>ool_bindings</Bindings> <DataSource>ool_ds</DataSource> <DataSourceView>ool_dsv</DataSourceView> <ErrorConfiguration>error_config</ErrorConfiguration> <Command1>cmd_1</Command1> <Command2>cmd_2</Command2> <Command3>cmd_3</Command3> ... </Batch>
The Transaction attribute is a boolean (true by default) that indicates whether the batch is transactional or not. If Transaction=true, then all commands in the batch are executed in a single transaction, i.e., all or nothing. If Transaction=false, then each command is executed in a separate transaction.
The ProcessAffectedObjects attribute is a boolean (false by default) that tells the server to reprocess other objects impacted by this command. See the Side Effects and Impact Analysis section for more details.
The Bindings, DataSource, DataSourceView and ErrorConfiguration elements are the same as discussed in the previous section. They apply to all commands inside the batch.
A Batch can contain a variety of commands including Create, Delete, Alter, Process, etc. Following is an example of a transactional batch that processes two dimensions and a cube.
<Batch Transaction="true" xmlns="http://schemas.microsoft.com/analysisservices/2003/engine"> <Process> <Object> <DatabaseID>Adventure Works DW</DatabaseID> <DimensionID>Customer</DimensionID> </Object> <Type>ProcessFull</Type> </Process> <Process> <Object> <DatabaseID>Adventure Works DW</DatabaseID> <DimensionID>Product</DimensionID> </Object> <Type>ProcessFull</Type> </Process> <Process> <Object> <DatabaseID>Adventure Works DW</DatabaseID> <CubeID>Adventure Works DW</CubeID> </Object> <Type>ProcessFull</Type> </Process> </Batch>
Batch commands are always executed sequentially. However, the Parallel command within a Batch allows you to specify multiple Process commands that are executed in parallel. Parallel is a special command in that it can only appear inside a Batch; it cannot be sent by itself. It cannot contain any command other than Process. It has the following syntax:
<Parallel MaxParallel="0"> <Process>cmd_1</Process> <Process>cmd_2</Process> <Process>cmd_3</Process> ... </Parallel>
The MaxParallel attribute specifies the degree of parallelism that the Analysis server can use when executing this Parallel command. It indicates the number of processing jobs that can run concurrently. Zero indicates unlimited parallelism, constrained only by hardware and server workload. The default is zero.
Following is an example of a Batch that processes two dimensions in parallel and then two partitions in parallel.
<Batch xmlns="http://schemas.microsoft.com/analysisservices/2003/engine"> <Parallel> <Process> <Object> <DatabaseID>Adventure Works DW</DatabaseID> <DimensionID>Customer</DimensionID> </Object> <Type>ProcessFull</Type> </Process> <Process> <Object> <DatabaseID>Adventure Works DW</DatabaseID> <DimensionID>Product</DimensionID> </Object> <Type>ProcessFull</Type> </Process> </Parallel> <Parallel> <Process> <Object> <DatabaseID>Adventure Works DW</DatabaseID> <CubeID>Adventure Works DW</CubeID> <MeasureGroupID>Fact Internet Sales</MeasureGroupID> <PartitionID>Internet_Sales_2001</PartitionID> </Object> <Type>ProcessFull</Type> </Process> <Process> <Object> <DatabaseID>Adventure Works DW</DatabaseID> <CubeID>Adventure Works DW</CubeID> <MeasureGroupID>Fact Internet Sales</MeasureGroupID> <PartitionID>Internet_Sales_2002</PartitionID> </Object> <Type>ProcessFull</Type> </Process> </Parallel> </Batch>
The Type element in the Process command specifies the type of processing to be done. This is also referred to as the "processing option". Choosing the right processing option is very important since it typically affects large volumes of data. This section describes the different processing options for each type of object.
Following are the types of Analysis Services objects that can be processed:
- Measure group
- Mining structure
- Mining model
Before we look at the processing options, we first need to understand what exactly processing is doing. Processing generally sends queries to the relational data source and uses the results to populate the Analysis Services object. Following are the storage contents that are built by processing for each object type. A complete discussion of the Analysis server storage format is beyond the scope of this document.
- None (a database is just a container of dimensions, cubes, and mining structures; it has no data storage of its own)
- Attribute stores
- Hierarchy stores
- Bitmap indexes
- MDX Script cache
- Measure group
- None (a measure group is just a container of partitions; it has no data storage of its own)
- Fact data
- Aggregation data
- Bitmap indexes
- Mining structure
- Training data
- Mining model
- Algorithm content
When an object is processed, the processing option specifies what part of the storage content of the object is to be built and how. The following table contains the list of all processing options and which objects they apply to.
|Processing Option||Database||Dimension||Cube||Measure Group||Partition||Mining Structure||Mining Model|
ProcessFull applies to all objects. It discards the storage contents of the object and and rebuilds them. ProcessFull is recursively applied to all descendants of the object as well.
ProcessClear applies to all objects. It discards the storage contents of the object. ProcessClear is recursively applied to all descendants of the object as well.
ProcessDefault applies to all objects. It does the bare minimum required to bring the object to a fully processed state. In other words, it builds only the storage contents that are not currently built. For example, if a partition has fact and aggregation data, then ProcessDefault will only build the bitmap indexes.
The only exception to the above rule is bindings. When you change the bindings of an object (e.g., KeyColumns of a dimension attribute), the object retains its data, but remembers that the bindings have changed. ProcessDefault on the object will discard all the storage contents and rebuild them.
The server handles ProcessDefault by analyzing the object and dynamically converting it to another processing option such as ProcessFull, ProcessIndexes, etc. ProcessDefault is recursively applied to all descendants of the object as well.
ProcessData applies only to the OLAP objects, i.e. dimension, cube, measure group and partition. It discards the storage contents of the object and rebuilds only the "data". For dimensions, it builds only the attribute and hierarchy stores. For partitions, it builds only the fact data.
Essentially ProcessData builds the bare minimum required for the object to be available for queries. Indexes are considered optional and affect only the query performance. ProcessData is recursively applied to all descendants of the object as well.
ProcessIndexes applies only to the OLAP objects, i.e., dimension, cube, measure group, and partition. It requires that the object must already have its "data" built; otherwise, it raises an error. ProcessIndexes preserves the data and rebuilds the "indexes". For dimensions, it builds the bitmap indexes. For partitions, it builds the aggregation data and bitmap indexes. ProcessIndexes is recursively applied to all descendants of the object as well.
ProcessUpdate applies only to dimensions. It is the equivalent of incremental dimension processing in Analysis Services 2000. It sends SQL queries to read the entire dimension table and applies the changes—member updates, additions, deletions.
Since ProcessUpdate reads the entire dimension table, it begs the question, "How is it different from ProcessFull?" The difference is that ProcessUpdate does not discard the dimension storage contents. It applies the changes in a "smart" manner that preserves the fact data in dependent partitions. ProcessFull, on the other hand, does an implicit ProcessClear on all dependent partitions. ProcessUpdate is inherently slower than ProcessFull since it is doing additional work to apply the changes.
Depending on the nature of the changes in the dimension table, ProcessUpdate can affect dependent partitions. If only new members were added, then the partitions are not affected. But if members were deleted or if member relationships changed (e.g., a Customer moved from Redmond to Seattle), then some of the aggregation data and bitmap indexes on the partitions are dropped. The cube is still available for queries, albeit with lower performance.
ProcessAdd applies only to dimensions and partitions.
ProcessAdd is a new processing option for dimensions that did not exist in Analysis Services 2000. It essentially optimizes ProcessUpdate for the scenario where only new members are added. ProcessAdd never deletes or updates existing members. It only adds new members. The user can restrict the dimension table so that ProcessAdd reads only the new rows.
ProcessAdd for partitions is the equivalent of incremental partition processing in Analysis Services 2000. The user typically specifies an alternate fact table or a filter condition pointing to the new rows. ProcessAdd internally creates a temporary partition, processes it with the specified fact data, and merges it into the target partition.
See the Out of Line Bindings section for details on how to specify the new rows for ProcessAdd.
ProcessStructure applies only to cubes and mining structures.
ProcessStructure for cubes is the equivalent of the Analysis Services 2000 processing option, processBuildStructure, in DSO. It discards the storage contents of the cube and its partitions. It implicitly does a ProcessDefault on all dimensions of the cube and marks the cube as processed. At this point, the cube is available to queries but it will not return any fact data. This is supported mostly for backward-compatibility reasons. It was useful in Analysis Services 2000 for parallel processing utilities. Once the "structure" of a cube is processed, its partitions can be processed in parallel by multiple client sessions without running into locking conflicts.
ProcessStructure on a mining structure discards its storage contents (training data) and rebuilds them. It does not affect the contents of the mining models under the mining structure.
ProcessScriptCache applies only to cubes. The MDX script in a cube can contain CACHE statements. ProcessScriptCache evaluates the script and persists the results for the CACHE statements.
ProcessClearStructureOnly applies only mining structures. It clears the storage contents (training data) of the mining structure while preserving the contents of its mining models.
Processing an object can produce side effects by affecting the storage contents of other objects. To understand this better, we need to first list the dependencies between objects.
- A data-mining dimension depends on its source-mining model.
- A cube depends on the dimensions that it uses.
- Measure group
- A measure group depends on the dimensions that it uses (typically a subset of the dimensions used by the parent cube).
- A linked measure group depends on its source measure group.
- A partition depends on the same dimensions as its parent measure group.
- Mining structure
- An OLAP mining structure depends on its source cube.
- Mining model
- An OLAP mining model depends on its source cube.
The following table lists the side effects produced by each processing option for each object type. Since ProcessDefault is dynamically converted to another processing option, no side effects are listed for it.
|Object Type||Processing Option||Objects that it depends on||Object that depend on it|
|ProcessUpdate||Clear affected indexes and aggregations on partitions|
|Measure Group||ProcessFull||ProcessDefault||ProcessClearStructureOnly / ProcessFull *|
|ProcessClear||ProcessClearStructureOnly / ProcessClear|
|ProcessData||ProcessDefault||ProcessClearStructureOnly / ProcessFull|
|Partition||ProcessFull||ProcessDefault||ProcessClearStructureOnly / ProcessFull|
|ProcessClear||ProcessClearStructureOnly / ProcessFull|
|ProcessData||ProcessDefault||ProcessClearStructureOnly / ProcessFull|
|ProcessAdd||ProcessClearStructureOnly / ProcessFull|
* ProcessClearStructureOnly is applied to OLAP mining structures. ProcessFull or ProcessClear is applied to linked measure groups.
From the above table, we see that when an object is processed, it can have an adverse effect (loss of data, aggregations, indexes, etc.) on other objects that depend on it. The ProcessAffectedObjects attribute on the Batch command tells the server to reprocess these objects using the ProcessDefault option. This is a convenient way to process an object and ensure that other objects are not adversely affected.
Impact analysis is the mechanism by which client applications can retrieve the side effects of a command without actually executing the command. It applies only to Create, Delete, Alter and Process commands. The client sends the Process command to server along with the XMLA property ImpactAnalysis set to true. The server analyzes the command and returns a rowset containing the list of side effects. The Impact Analysis dialog (launched from the Process dialog) in SQL Server Management Studio displays the list of affected objects and provides the option to process them in the same request.
Analysis Services 2005 has a sophisticated processing engine with ample opportunity for parallelism. It is worthwhile to understand how a processing request is handled inside the engine. A processing request can be one of the following:
- a <Process> command
- a <Process> command inside a <Batch>
- a <Parallel> command inside a <Batch>
The processing engine handles a processing request as follows:
- The request is analyzed in conjunction with the objects and their dependencies and an execution plan is produced. The execution plan is essentially a list of jobs and dependencies between the jobs.
- The plan is then executed. A job can be run only if all of its dependent jobs have completed. Jobs that have no dependencies between each other are run in parallel.
For example, when a database is processed, the dimensions are first processed in parallel. Then all cubes are processed in parallel. Within each cube, the measure groups and their partitions are also processed in parallel. This, of course, gets more complicated if the database has linked measure groups and OLAP mining models.
Even within a single object, processing can be subdivided into parallel jobs. Dimensions and partitions are prime examples.
Dimension processing involves the following jobs:
- Attribute processing jobs
- Hierarchy processing jobs
- Bitmap indexing jobs
The following picture shows the dependencies between the jobs. The attribute processing jobs have dependencies based on the relationships between the attributes. The key attribute is always processed last since it has a relationship to every other attribute. The hierarchy processing jobs and bitmap indexing jobs depend on the attribute processing jobs. Once attribute processing is complete, all hierarchies and bitmap indexes are built in parallel.
Partition processing involves the following jobs:
- Fact data processing job
- Bitmap indexing jobs
- Aggregation processing jobs
The following picture shows the dependencies between the jobs. The bitmap indexing jobs and aggregation processing jobs depend on the fact data processing job. Once the fact data is processed, all bitmap indexes and aggregations are built in parallel.
In the previous sections, we have seen several cases where processing can involve parallelism. However, the execution plan does not provide any guarantee for parallelism. Parallelism can be throttled for various reasons including the following:
- CoordinatorExecutionMode: This is a server-wide configuration value (specified in the msmdsrv.ini file) that specifies the maximum number of concurrent jobs in a single processing request. The server administrator can use this to throttle parallelism. The default is zero, i.e., unlimited parallelism.
- MaxParallel: This is an attribute in the <Parallel> command. It specifies the maximum number of concurrent jobs in the request. It does not override CoordinatorExecutionMode. It can only further throttle the parallelism.
- CPUs: Depending on the number of CPUs, the Analysis server can artificially throttle the parallelism to avoid unnecessary context switching.
- Memory Quotas: Every job requests a memory quota that is the minimum and maximum amount of memory it needs. If the memory management system cannot grant it the minimum requested memory, then it blocks the job until the memory is available.
Lazy processing is the mechanism by which the Analysis server automatically builds bitmap indexes and aggregations for dimensions and partitions. Lazy processing is done in the background at a low priority, so it does not interfere with foreground operations such as user queries. The advantage of lazy processing is that it reduces the foreground processing time and the cube is available earlier to users, albeit with lower performance.
Lazy processing is enabled on a dimension, cube, measure group or partition by setting the DDL property ProcessingMode to LazyAggregations. The Analysis server has a background thread that loops through all objects and initiates a ProcessIndexes job on any object that has ProcessingMode=LazyAggregations and needs bitmap indexes or aggregations to be built. The ProcessingPriority DDL property can be used to control the order in which the background thread processes the objects.
Database administrators can monitor lazy processing using Profiler, just like any other processing operation.
The alternative to lazy processing is to send a ProcessIndexes command explicitly to the server. This approach gives the database administrator control over when the processing is done and the ability to handle errors proactively. The disadvantage is that the processing is done in a foreground job and can thereby interfere with other user requests.
The Bindings, DataSource and DataSourceView elements in the <Process> and <Batch> commands are collectively referred to as "out of line bindings". This is an advanced processing feature that provides the ability to specify or override the bindings of an object at processing time. Note that the object does not persist or remember these bindings after the processing has completed. The bindings need to be re-specified in every processing command as appropriate.
Following are some scenarios where out of line bindings are useful.
ProcessAdd on a partition is the operation in which new fact table rows are added to the partition. It is important to identify the new rows by specifying an alternate fact table or a filter condition. This is done using out of line bindings. The following command (scripted out from the Process dialog) illustrates this.
<Process xmlns="http://schemas.microsoft.com/analysisservices/2003/engine"> <Object> <DatabaseID>Adventure Works DW</DatabaseID> <CubeID>Adventure Works DW</CubeID> <MeasureGroupID>Fact Internet Sales</MeasureGroupID> <PartitionID>Internet_Sales_2004</PartitionID> </Object> <Type>ProcessAdd</Type> <Bindings> <Binding> <DatabaseID>Adventure Works DW</DatabaseID> <CubeID>Adventure Works DW</CubeID> <MeasureGroupID>Fact Internet Sales</MeasureGroupID> <PartitionID>Internet_Sales_2004</PartitionID> <Source xsi:type="QueryBinding"> <DataSourceID>Adventure Works DW</DataSourceID> <QueryDefinition>SELECT * FROM [dbo].[FactInternetSales] WHERE OrderDateKey > '1280'</QueryDefinition> </Source> </Binding> </Bindings> </Process>
ProcessAdd on dimensions is similar in concept but more complicated in practice. Unlike partitions, dimensions can be built out of multiple (snowflaked) tables. Hence the Dimension object does not have a single <Source> property that can be overridden in the out of line bindings. The only way to do this is to specify an out of line DataSourceView in which the dimension table is filtered appropriately. The following command illustrates this. Details of the DataSourceView are omitted in the interest of clarity.
<Process xmlns="http://schemas.microsoft.com/analysisservices/2003/engine"> <Object> <DatabaseID>Adventure Works DW</DatabaseID> <DimensionID>Dim Customer</DimensionID> </Object> <Type>ProcessAdd</Type> <DataSourceView> <ID>Adventure Works DW</ID> <Name>Adventure Works DW</Name> <DataSourceID>Adventure Works DW</DataSourceID> <Schema> ... <xs:element name="dbo_DimCustomer" msprop:QueryDefinition="SELECT * FROM DimCustomer WHERE CustomerKey > 29483" msprop:DbTableName="DimCustomer" msprop:IsLogical="True" msprop:TableType="View"> ... </Schema> </DataSourceView> </Process>
Processing is usually done by Analysis Services in pull-mode, i.e., the Analysis server pulls the data out of the relational data source and populates the cube. Push-mode processing is the opposite where the client pushes the data into the Analysis server. Push-mode processing is done by sending a rowset as a parameter to the <Process> command. The rowset itself can be chunked across multiple <Process> requests. Out of line bindings are required for push-mode processing in order to map the columns of the rowset parameter to the object being processed.
The Integration Services Partition Processing Data Flow Item uses chunked push-mode processing to pump data into a partition. The following command (extracted from Profiler during an Integration Services package execution) illustrates push-mode processing. The PushedDataSource references two parameters, InputRowset and EndOfInputRowset. The InputRowset parameter contains the rowset chunk in the XMLA rowset format. The EndOfInputRowset parameter is a boolean that indicates if this is the last chunk.
<Process xmlns="http://schemas.microsoft.com/analysisservices/2003/engine"> <Type>ProcessAdd</Type> <Object> <DatabaseID>Adventure Works DW</DatabaseID> <CubeID>Adventure Works DW</CubeID> <MeasureGroupID>Fact Internet Sales 1</MeasureGroupID> <PartitionID>Internet_Sales_2004</PartitionID> </Object> <DataSource xsi:type="PushedDataSource"> <root Parameter="InputRowset"/> <EndOfData Parameter="EndOfInputRowset"/> </DataSource> <Bindings> <Binding> <DatabaseID>Adventure Works DW</DatabaseID> <CubeID>Adventure Works DW</CubeID> <MeasureGroupID>Fact Internet Sales</MeasureGroupID> <CubeDimensionID>Dim Product</CubeDimensionID> <AttributeID>Product Name</AttributeID> <KeyColumns> <KeyColumn> <Source xsi:type="ColumnBinding"> <TableID/> <ColumnID>ProductKey</ColumnID> </Source> </KeyColumn> </KeyColumns> </Binding> ... <Binding> <DatabaseID>Adventure Works DW</DatabaseID> <CubeID>Adventure Works DW</CubeID> <MeasureGroupID>Fact Internet Sales</MeasureGroupID> <MeasureID>Sales Amount</MeasureID> <Source xsi:type="ColumnBinding"> <TableID/> <ColumnID>SalesAmount</ColumnID> </Source> </Binding> ... </Bindings> </Process>
SQL Server Integration Services is a high performance ETL (Extraction, Transformation, and Load) component that can be used for a variety of data transformation and movement tasks. Integration Services can be used for automating and scheduling complex data warehouse maintenance operations.
Processing is a typical maintenance operation in a data warehouse and Integration Services offers two mechanisms for processing Analysis Services objects.
Analysis Services Processing Task
The Analysis Services Processing Task connects to the specified Analysis server and processes one or more objects with the specified processing options. It can be included in an Integration Services package workflow along with other data warehouse maintenance tasks such as data cleansing, transformation, etc.
Analysis Services Data Flow Items
Integration Services offers three Analysis Services Data Flow Items:
- Partition Processing
- Dimension Processing
- Data Mining Model Training
These items can be included in an Integration Services Data Flow task as a destination. The package designer chooses the destination object and maps the columns of the data flow to it. When the task is executed, the data flows through the pipeline and into the object on the Analysis server. This is accomplished using the chunked push-mode processing mechanism in Analysis Services.
The following pictures show the editor for the partition-processing item. The dimension processing and data-mining model training items are similar.
Analysis Services 2005 offers a high performance processing engine with fine-grained controls for parallel processing, batch processing, transaction management, impact analysis, lazy processing, data integrity error handling, etc. It supports the classic pull-mode as well as push-mode processing. And above all, it offers a rich set of processing options that database administrators can use to better manage their data warehouse maintenance tasks.