Chapter 18 - Using Partitions in a SQL Server 2000 Data Warehouse
This chapter discusses the role of partitioning data in the data warehouse. The relational data warehouse and Microsoft® SQL Server™ 2000 Analysis Services cubes both support the partitioning of data. The logical concept behind partitioning is the same in both engines of SQL Server: to horizontally partition data by a key such as date. In the relational database, partitioning is accomplished by creating separate physical tables — for example, one table for each month of data — and defining a union view over the member tables. Similarly, Analysis Services in SQL Server Enterprise Edition supports the explicit partitioning of cubes. In both the relational database and the online analytical processing (OLAP) engine, the complexity of the physical storage is hidden from the analytical user.
The benefits of partitioning a data warehouse are tremendous, substantially reducing query time, improving the load time and maintainability of the databases, and solving the data pruning problem associated with removing old data from the active database. The technique requires building a more complex data staging application than a non-partitioned system. This paper describes best practices for designing, implementing, and maintaining horizontally partitioned data warehouses.
Partitions are strongly recommended for large Analysis Services systems, because an effective partitioning plan will improve query performance substantially. Partitioning the relational data warehouse is not generally recommended, although it can be an effective and well performing solution to some specific warehouse maintenance issues, as discussed below. For a data warehouse that keeps a rolling window of data available (for example, 13 weeks or 37 months), partitions provide a clean mechanism for pruning old data.
Note The code examples in this chapter are also available on the SQL Server 2000 Resource Kit CD-ROM in file, \Docs\ChapterCode\CH18Code.txt. For more information, see Chapter 39, "Tools, Samples, eBooks, and More."
Using Partitions in a SQL Server 2000 Relational Data Warehouse
A partitioned view joins horizontally partitioned data from a set of members, making the data appear as if from one table. SQL Server 2000 distinguishes between local and distributed partitioned views. In a local partitioned view, all participating tables and the view reside on the same instance of SQL Server. In a distributed partitioned view, at least one of the participating tables resides on a different (remote) server. Distributed partitioned views are not recommended for data warehouse applications.
Dimensional data warehouses are structured around facts and dimensions, and are usually physically instantiated as star schemas, snowflake schemas, and very occasionally as fully denormalized flat tables that combine both facts and dimensions. The discussion in this paper is focused on the use of partitions with a dimensional schema, as these schemas are the most common structure for a relational data warehouse. The recommendations herein are applicable to more general data warehousing schemas.
Advantages of Partitions
Many data warehouse administrators choose to archive aged data after a certain time. For example, a clickstream data warehouse may keep only three to four months of detailed data online. Other common strategies keep 13 months, 37 months, or 10 years online, archiving and removing from the database the old data as it rolls past the active window. This rolling window structure is a common practice with large data warehouses.
Without partitioned tables, the process of removing aged data from the database requires a very large DELETE statement, for example:
DELETE FROM fact_table WHERE date_key < 19990101
This statement is expensive to execute, and is likely to take more time than the load process into the same table. With partitioned tables, by contrast, the administrator redefines the UNION ALL view to exclude the oldest table, and drops that table from the database (presumably after ensuring it has been backed up) – a process that is virtually instantaneous.
As we discuss below, it is expensive to maintain partitioned tables. If data pruning is the only reason to consider partitioning, the designer should investigate a data nibbling approach to removing old data from an unpartitioned table. A script that deletes 1000 rows at a time (use the set rowcount 1000 command) could run continuously on a low-priority process, until all desired data are removed. This technique is used effectively on large systems, and is a more straightforward approach than building the necessary partition management system. Depending on load volumes and system utilization, this technique will be appropriate for some systems, and should be benchmarked on the system under consideration.
The fastest way to load data is into an empty table, or a table with no indexes. By loading into smaller partitioned tables the incremental load process can be significantly more efficient.
Once the data warehouse staging application has been built to support partitioning, the entire system becomes easier to maintain. Maintenance activities including loading data, and backing up and restoring tables, can execute in parallel, achieving dramatic increases in performance. The process of incrementally populating downstream cubes can be speeded up and simplified.
Query speed should not be considered a reason to partition the data warehouse relational database. Query performance is similar for partitioned and non-partitioned fact tables. When the partitioned database is properly designed, the relational engine will include in a query plan only the partition(s) necessary to resolve that query. For example, if the database is partitioned by month and a query is conditioned on Jan-2000, the query plan will include only the partition for Jan-2000. The resulting query will perform well against the partitioned table, about the same as against a properly indexed combined table with a clustered index on the partitioning key.
Disadvantages of Partitions
The primary disadvantage of partitions is the requirement that the administrator build an application to manage the partitions. It would be inappropriate to move into production a data warehouse that uses horizontal partitions in the relational database, without first designing, testing, and rolling out an application to manage those partitions. One of the goals of this paper is to discuss the issues and design decisions underlying the partition management application.
Query Design Constraints
For the best query performance, all queries must place conditions on the filter key directly in the fact table. A query that places the constraint on a second table, such as a Dates dimension table, will include all partitions in the query.
Dimensional data warehouses are structured around facts and dimensions, and are usually physically instantiated as star schemas, snowflake schemas, and very occasionally as fully denormalized flat tables that combine both facts and dimensions. The administrator of a dimensional data warehouse typically partitions only the fact tables; there would seldom be an advantage to partitioning a dimension table. In some circumstances, a very large dimension table containing more than 10 million members may benefit from partitioning. A non-dimensional relational data warehouse can also be partitioned, and the general remarks in this paper still apply.
An effective partitioning plan is developed in the context of the system architecture and design goals. Even with identical schema designs, a relational data warehouse that exists only to populate Analysis Services cubes may imply a different partitioning structure than one queried directly by analysts. A system with a rolling window will necessarily be partitioned by time; others may not.
If the data warehouse includes Analysis Services cubes, Microsoft recommends that the partitions in the relational data warehouse and Analysis Services databases be structured in parallel. The maintenance application is simplified: the application creates a new cube partition at the same time as it creates a new table in the relational database. Too, administrators need learn only one partitioning strategy. This is merely a simplifying recommendation, however. An application may have a compelling reason to partition the two databases differently, and the only downside would be the complexity of the maintenance application.
Overview of Partition Design
Partitioned tables in the SQL Server database can use updatable or queryable (nonupdatable) partitioned views. In both cases, the table partitions are created with CHECK constraints that each partition contains the correct data. An updatable partitioned view will support an INSERT (or UPDATE or DELETE) on the view, and push the operation to the correct underlying table. While this is a nice benefit, a data warehouse application typically needs to bulk load, which cannot be performed through the view. The table below summarizes the requirements, advantages, and disadvantages of updatable and queryable partitioned views.
Updatable partitioned view
· Partition key(s) enforced by CHECK constraint(s)
· Query performance: query plans include only those member tables necessary to resolve the query.
· Load performance: data loading through the view occurs too slowly for this approach to be viable for most data warehousing applications.
Queryable partitioned view
· Partition key(s) enforced by CHECK constraint(s)
· Query performance: query plans include only those member tables necessary to resolve the query.
· View is limited to 256 member tables.
Microsoft's recommended practice is to design the fact table as a local (on a single server) partitioned union view with the primary key defined. In most cases this definition will result in the partitioned view also being updatable, but the data warehouse maintenance application should be designed to bulk load most data directly into the member tables rather than through the view.
The following code sample illustrates the syntax for defining the member tables and the union view, and for inserting data into the view:
-- Create the fact table for 1999 CREATE TABLE [dbo].[sales_fact_19990101] ( [date_key] [int] NOT NULL CHECK ([date_key] BETWEEN 19990101 AND 19991231), [product_key] [int] NOT NULL , [customer_key] [int] NOT NULL , [promotion_key] [int] NOT NULL , [store_key] [int] NOT NULL , [store_sales] [money] NULL , [store_cost] [money] NULL , [unit_sales] [float] NULL ) ALTER TABLE [sales_fact_19990101] ADD PRIMARY KEY ( [date_key], [product_key], [customer_key], [promotion_key], [store_key]) -- Create the fact table for 2000 CREATE TABLE [dbo].[sales_fact_20000101] ( [date_key] [int] NOT NULL CHECK ([date_key] BETWEEN 20000101 AND 20001231), [product_key] [int] NOT NULL , [customer_key] [int] NOT NULL , [promotion_key] [int] NOT NULL , [store_key] [int] NOT NULL , [store_sales] [money] NULL , [store_cost] [money] NULL , [unit_sales] [float] NULL ) ALTER TABLE [sales_fact_20000101] ADD PRIMARY KEY ( [date_key], [product_key], [customer_key], [promotion_key], [store_key]) --Create the UNION ALL view. CREATE VIEW [dbo].[sales_fact] AS SELECT * FROM [dbo].[sales_fact_19990101] UNION ALL SELECT * FROM [dbo].[sales_fact_20000101] --Now insert a few rows of data, for example: INSERT INTO [sales_fact] VALUES (19990125, 347, 8901, 0, 13, 5.3100, 1.8585, 3.0) INSERT INTO [sales_fact] VALUES (19990324, 576, 7203, 0, 13, 2.1000, 0.9450, 3.0) INSERT INTO [sales_fact] VALUES (19990604, 139, 7203, 0, 13, 5.3700, 2.2017, 3.0) INSERT INTO [sales_fact] VALUES (20000914, 396, 8814, 0, 13, 6.4800, 2.0736, 2.0) INSERT INTO [sales_fact] VALUES (20001113, 260, 8269, 0, 13, 5.5200, 2.4840, 3.0)
To verify that the partitioning is working correctly, use SQL Query Analyzer to show the query plan for a query such as:
SELECT TOP 2 * FROM sales_fact WHERE date_key = 19990324
You should see only the 1999 table included in the query plan. Compare this query plan with that generated by the same two tables with the primary key removed: the 2000 table is still excluded. Contrast these plans with the query plan generated against the schema with the constraint on date_key removed. With the constraint removed, both the 1999 and the 2000 tables are included in the query.
Note that in general, it is good practice to use the "TOP N" syntax when doing exploratory queries against large tables, as it returns results quickly with minimal server resources. When looking at partitioned table query plans, it's even more important, because the query plan generated by a "SELECT *" statement is difficult to parse. To the casual observer, it looks as if the query plan includes all the component tables of the UNION ALL view, although at query execution time only the appropriate tables are used in the query.
Apply Conditions Directly to the Fact Table
For the best query performance, all queries must place conditions on the filter key directly in the fact table. A query that places the constraint on a second table, such as a Dates dimension table, will include all partitions in the query. Standard star join queries into a UNION ALL fact table work well.
Place conditions on the non-partitioned dimensions in those dimension tables, as is standard practice.
Include attributes from the partitioning dimension (Dates).
Design queries against a partitioned dimensional schema exactly as you would against a non-partitioned schema, with the exception that conditions on dates are most effective when placed directly on the date key in the fact table.
If each partition table has a clustered index with date as the first column in the index, the cost of going to all partitions to resolve an ad hoc query is relatively small. Predefined queries, such as those that generate standard reports or that incrementally update downstream databases, should be written as efficiently as possible.
Choice of Partition Key(s)
The fact table can be partitioned on multiple dimensions, but most practitioners will partition only date. As described previously, date partitioning enables easy "rolling window" management, and older partitions may even be located in a different place, or more lightly indexed, than fresher partitions. Too, most queries into the data warehouse filter on date.
For applications partitioned by date, the decision variables are:
How much data to keep online? This decision should be driven largely from the business requirements, tempered by the cost-benefit ratio of keeping very large volumes of data online.
What should the date key look like? It is a widely accepted data warehousing best practice to use surrogate keys for the dimension and fact tables. For fact tables partitioned by date, the recommended practice is to use "smart" integer surrogate keys of the form "yyyymmdd". As an integer, this key will use only 4 bytes, compared to the 8 bytes of a datetime key. Many data warehouses use a natural date key of type datetime.
How granular should the partitions be? Although the example above uses annual partitions, most systems will choose a finer granularity such as month, week, or day. While it's mildly interesting to consider whether user queries tend to fall along month or week boundaries, by far the most important factor is the overall size and manageability of the system. Recall that any SQL query can reference at most 256 tables; for data warehouses that maintain more than a few months of data, a UNION ALL view over daily partitions will hit that limit. As a rule of thumb, a fact table that is partitioned only on date would most likely partition by week.
How are the partition ranges defined? The BETWEEN syntax is most straightforward and human-readable, and performs efficiently. For example consider monthly partitions of the form:
date_key < 19990101 date_key BETWEEN 1990101 AND 19990131 date_key BETWEEN 19990201 AND 19990229 … date_key BETWEEN 19991201 AND 19991231 date_key > 19991231
Note the first and last partitions above: it is a good practice to define these partitions even if you expect never to put data into them, in order to cover the universe of all possible date values. Note also that the February partition covers data through Feb-29, although 1999 is not a leap year. This structure removes the need to include leap year logic in the design of the application that creates partitions and constraints.
Are partitions merged together over time? In order to minimize the number of active partitions, the database administrator may choose to build the partitioning application in such a way that daily partitions are merged into weeks or months. This approach is discussed in greater detail below, in the section on populating and maintaining partitions.
This detailed discussion of how to partition by date should illuminate the discussion of other prospective partition keys.
Data loading: If there is a strong tendency for incoming data to align by another dimension — for example, if data for each Store or Subsidiary are delivered by different systems — these are natural partition keys.
Data querying of cubes: Although there is no technical reason for the relational database and Analysis Services cubes to be partitioned in the same way, it is common practice to do so. The maintenance application is simpler if this assumption is made. Thus, even if the relational database exists only to populate Analysis Services cubes, consideration should be given to common query patterns when choosing partition keys.
The conventions for naming the member tables of a horizontally partitioned fact table should flow naturally from the partition design. For greatest generality, use the full partition start date in the title: [sales_fact_yyyymmdd] is preferred to [sales_fact_yyyy], even if the partitioning is annual.
If the database supports partitions at multiple granularities, the naming convention should reflect the time span held within each partition. For example, use sales_fact_20001101m for a monthly partition, and sales_fact_20001101d for a daily one.
The names of the member tables are hidden from end users, who access data through the view, so the member table names should be oriented to the maintenance application.
Partitioning for Downstream Cubes
If the only use of the relational data warehouse is to support Analysis Services cubes, it is not necessary to define the UNION ALL view. In this case, the 256-table limit would not apply to the application, but it is not a recommended practice to partition the relational data warehouse in such a way that a UNION ALL view could not be defined.
Managing the Partitioned Fact Table
The partitioned data warehouse should not be moved into production until the management of the partitions has been automated and tested. The partition management system is a simple application, and the general requirements of that system are outlined here.
The discussion below assumes that the partitioning will occur along date.
A robust partition management system is driven by meta data. That meta data can be stored anywhere, as long as the meta data is accessible programmatically. Most data warehouse systems use custom meta data tables defined on the data warehouse SQL Server, or SQL Server 2000 Meta Data Services.
Whatever the meta data storage mechanism, the contents of the meta data must include the following information on each partition:
Date partition created
Date ranges of data in partition
Date partition moved online (included in UNION ALL view)
Date partition moved offline (dropped from view)
Date partition dropped
Other meta data tables that are part of the data warehouse's overall management system should track when and how much data are loaded into each partition.
Creating New Partitions
The first task of the partition management system is to create new partitions. A job should be scheduled to run periodically, to create a new table that will serve as the next partition.
There are many effective ways to perform this task. The recommended approach is to use SQL-DMO (SQL Distributed Management Objects) to create a new table with the same structure and indexes as the existing partitions, but with a new table name, index names, partition key constraint definitions, filegroups, and so on:.
Get the template table definition (usually the most recent partition).
Modify the table and index Name properties, check constraint Text property, and other properties.
Use the ADD method to instantiate the table.
With intelligent naming conventions, the task can be accomplished with few lines of code.
As discussed later in this chapter, your application may use Analysis Services partitions for the data warehouse system's cubes. If so, the script or program that creates the partition tables in the RDBMS can go on to create the corresponding cube partition, using Decision Support Objects (DSO).
Populating the Partitions
As described above, data can be loaded into a UNION ALL view. In theory this is a great feature of the table partitioning structure, but in practice it is not recommended for data warehouse applications. Data loads into the UNION ALL view cannot be performed in bulk; the load process will be too slow for a data warehouse that is large enough to warrant partitioned tables.
Instead, the data warehouse loading application must be designed to fast load data for each period into the appropriate target table. If the data staging application is implemented in Data Transformation Services (DTS), the Dynamic Properties task can easily change the name of the target table of the Transform Data task or the Bulk Insert task.
As long as the new partition does not yet participate in the UNION ALL view, the data load requires no system downtime.
The data warehouse staging application should be designed to handle incoming data that does not belong in the current partition. This case could occur as an exception to normal events, if the data warehouse loading process did not occur one night. Other systems are faced with newly arrived old data on an ongoing basis. The system's design must consider the likelihood, frequency, and data volumes of these exceptions.
If old data arrives in sufficiently low volume, the simplest design would use the updatable UNION ALL view to load all data that doesn't belong to the current partition.
Defining the UNION ALL View
Once the incremental load has successfully finished, the UNION ALL view must be revised. SQL-DMO is again the recommended approach for this task: use the ALTER method to change the TEXT property of the VIEW object. The list of partitions to include in the view definition is best derived from the meta data table described above.
There are potential benefits in merging several partitions into a single larger partition. A data warehouse with large daily load volumes and a small load window may find significant gains in load performance by:
Creating a text file with the data to be loaded, sorted in the order of the clustered index.
Bulk-loading into empty daily partitions.
Creating all nonclustered indexes.
Bringing the new partition online by recreating the UNION ALL view.
Weekly, creating and populating a new weekly partition by inserting from the daily partitions, rebuilding indexes, and recreating the UNION ALL view. The daily partitions could then be dropped.
By moving to weekly or even monthly partitions as the data ages, more partitions can be kept online within the UNION ALL view.
Using Partitions in SQL Server 2000 Analysis Services
Analysis Services in SQL Server 2000 Enterprise Edition explicitly supports partitioned cubes that are analogous to partitioned tables in the relational database. For a cube of moderate to large size, partitions can greatly improve query performance, load performance, and ease of cube maintenance. Partitions can be designed along one or more dimensions, but cubes are often partitioned only along the Dates dimension. The incremental loading of a partitioned cube, including the creation of new partitions, should be performed by a custom application.
Partitions can be stored locally or distributed across multiple physical servers. Although very large systems can also benefit from distributing partitions among multiple servers, our tests indicate that distributed partition solutions provide the most benefit when cubes are in the multiterabyte size range. The current paper considers only locally partitioned cubes.
The incremental loading of a partitioned cube, including the creation of new partitions, should be performed by a custom application.
Advantages of Partitions
The performance of queries is substantially improved by partitioning the cube. Even moderate sized cubes, based on just 100 gigabytes (GB) of data from the relational database, will benefit from partitioning. The benefits of cube partitioning are particularly noticeable under multiuser loads.
The query performance improvement that each application will see varies by the structure of the cube, the usage patterns, and the partition design. A query that requests only one month of data from a cube partitioned by month, will access only one partition. In general, we expect that moving from a large cube in a single partition, to a well-designed local partitioning strategy, will result in average query performance improvement of 100 percent to 1,000 percent.
Pruning Old Data
As with the relational data warehouse, the Analysis Services system administrator may choose to keep only recent data in a cube. With a single partition, the only way to purge the old data is to re-process the cube. By partitioning along the Dates dimension, the administrator can drop old partitions without system downtime.
From an administrative point of view, partitions are independent data units that can be added and dropped without impacting other partitions. This helps with managing the lifecycle of data in the system. Each cube partition is stored in a separate set of files. Backup and restore operations of these data files are easier to manage with the smaller partition files. This is especially true if each partition file is under 2 GB in size. In this case the Archive and Restore utility will be effective. If a portion of the cube is corrupted or is discovered to contain incorrect or inconsistent data, that partition can be reprocessed much more quickly than the entire cube. In addition, it is possible to change the storage mode and aggregation design of older partitions to save space.
Different partitions can use different data sources. A single cube can combine data from multiple relational databases. For example, a corporate data warehouse may be constructed so that data from Europe and North America are hosted on different servers. If the cube is partitioned by geography, it can logically combine these disparate data sources. The relational schema must be virtually identical on the source servers for the single cube definition to work properly.
A partitioned cube can be loaded much more quickly than a non-partitioned cube, because multiple partitions can be loaded in parallel. As we discuss below, you must acquire a third part tool, or build a simple custom tool, in order to process partitions in parallel. On a multiprocessor computer, the performance benefits are significant. The parallel-processing tool should aim for 90 percent CPU utilization. This performance is typically achieved by simultaneously processing between one and two partitions for every two processors. For example, on a four-processor computer with all processors devoted to processing the cube, you will want to process between two and four partitions simultaneously. If you try to process more partitions than you have processors, performance will degrade significantly. One partition for each two processors is conservative; the ideal number depends on speed of data flow from the source databases, aggregation design, storage, and other factors.
Under some circumstances, it is more efficient to rebuild a partition than to incrementally process the partition. Of course, this is far less likely to be the case if the entire cube is held in a single partition.
Disadvantages of Partitions
The primary disadvantage of partitions is the requirement that the administrator build an application to manage the partitions. It would be inappropriate to move a partitioned cube into production, without first designing, testing, and rolling out an application to manage those partitions. One of the goals of this chapter is to discuss the issues and design decisions underlying the partition management application.
Meta Data Operations
As the number of partitions increase, the performance of meta data operations such as browsing the cube definition, declines. This is a burden for the administrator rather than the end user, but an excessively partitioned cube will be difficult to administer.
Overview of Partitions
An effective query plan balances multiple considerations:
Number of partitions: Analysis Services imposes no practical limits on the number of partitions in a cube, but a cube with several thousand partitions will be challenging to manage. In addition, there is a point at which the cost of combining result sets from multiple partitions outweighs the query performance benefits of partition selectivity. It is difficult to provide a rule of thumb for where this point might be, as it depends on cube design, query patterns, and hardware, but it's probably safe to have one partition for every gigabyte of cube storage, or each ten million rows of fact data. In other words, a 100-GB cube (alternatively, 1 Billion facts) on hardware appropriate for that data volume, should easily support 100 partitions. If the partition design calls for significantly more partitions than that, test the performance of alternative partition plans.
Load and maintenance: Data may naturally flow into the cube along certain dimensions such as time. In order to support the staging application to populate and incrementally refresh the cube, these dimensions may be natural partition slices. The Dates dimension, for example, is usually the first partition dimension. Other applications may receive data segmented by geographic region, customer segment, and so on. Because different partitions can use different data sources, the cube population program can efficiently load data from a distributed data warehouse or other source system.
Query performance: An effective partition design relies on some knowledge of common user query patterns. An ideal partitioning dimension is very selective for most detailed user queries. For example, partitioning along Date often improves query performance, as many queries are focused on details in the most recent time periods. Similarly, perhaps many users focus queries along geographic or organizational lines. For maximum query performance improvement, you want queries touching as few partitions as possible.
It is easier to manage partitions along dimensions that are static or, like Date, change in a predictable way. For example, a partition along the "States in the US" is relatively static, as the application designers could expect to receive plenty of warning of a 51st state. By contrast, partitions along the Product dimension are likely to change over time, as new products are added relatively frequently. It may still be desirable to partition along a dynamic dimension, but the designer should note that the administrative system must necessarily be more complex. A dimension that is marked as "changing", then partitioning along that dimension is not permitted. In any case, it is recommended that you create an "all other" partition to hold data for unexpected dimension members.
Slices and Filters
Just as with relational partitions, Analysis Services partitions rely on the administrator to define the data to be found in each partition. The RDBMS uses the CHECK CONSTRAINT to perform this function; Analysis Services uses the slice. A slice is set to a single member in a dimension, such as [Dates]. or [Dates]..[Q1]. In the Analysis Manager Partition Wizard, the slice is set in the screen titled "Select the data slice (optional)". In DSO, the slice is accessed and set using the SliceValue property of the partition's dimension level object. Sample syntax is provided later in this document.
The definition of each partition also includes information about what source data flow into this partition. The partition meta data stores the information necessary to populate the partition. The administrator can set the data source and the fact table with the Partition Wizard, or programmatically with DSO. At the time a partition is processed, the settings of its SliceValue property are automatically transformed into a filter against the source. The partition definition optionally includes an additional filter, the SourceTableFilter property, that can be used to refine the query that will populate the partition. At the time the partition is processed, the WHERE clause of the query issued against the source data will include both the default conditions based on the slice definition, and any additional filter(s) defined by the SourceTableFilter property.
Slices and filters must both be properly defined in order for the partitions to work correctly. The role of the Slice is to improve query performance. The Analysis Services engine uses the information in the partition Slice definition to direct a query only to the partition(s) that contain the underlying data. Queries will resolve accurately on a partitioned cube without defined partition slices, but query performance will not be optimized because each query must examine all partitions in the absence of slice definitions.
The role of the filter and source meta data is to define the data that flow into the partition. These elements must be correctly defined, or the overall cube will have incorrect data. When a partition is processed, Analysis Services constrains the data stored in the cube to match the Slice. But no checks are performed to ensure the data are not also loaded into another partition.
For example, imagine that you've partitioned a cube by year, and you incorrectly set the Slice for the 1998 partition to [Dates].[Year].[ 1997 ], but constrained the filter to 1998. The partition, when processed, would contain zero rows: probably the desired result. By contrast, if you had an existing partition for 1998 and added a new partition for Dec-1998, it would be easy to load the Dec-1998 data twice, and you would receive no notification from Analysis Services that this event had occurred.
It is not difficult to keep partition slices and filters aligned, but it is imperative that the partition management system designer be aware of the issues.
Advanced Slices and Filters
Most partition strategies identify a dimension level to partition, and put the data for each member of that dimension in its own partition. For example, "partition by year" or "partition by state".
It is also common to define a partition plan that drills down on one part of the cube. For example, recent data may be partitioned by day or week; older data by month or year.
Depending on usage patterns and data cardinality, it may be desirable to design a more complex partition plan. For example, imagine that 80 percent of customers live in California, 10 percent in Oregon, and the remaining 10 percent are distributed evenly across the rest of the country. Further, most analysis is focused on local customers (California). In this case, the administrator may wish to create county-level partitions for California, a state-level partition for Oregon, and one partition for the rest of the United States.
The slices would be something like:
California counties: [All USA].[CA].[Amador] … [All USA].[CA].[Yolo]
Oregon state: [All USA].[OR]
Rest of the country: [All USA]
As discussed above, source data filters would have to be correctly defined to ensure that these partitions are populated correctly. Note that a query that needs to combine data from California and Oregon would also have to look at the "Rest of the country" partition. While it is not very expensive for Analysis Services to look at the map of the "Rest of the country" to learn there is no relevant data therein, query performance would have been better if the cube were partitioned uniformly by state with drilldown on CA. The application logic required to maintain uneven partitions is also more complex, and in general this partitioning approach is not recommended. However, with appropriate care in the design of the maintenance application, and understanding of the query performance tradeoffs, the technique may solve specific design problems.
As the first half of this paper discusses partitions in the RDBMS, it is natural to ask whether Analysis Services partitions must be aligned with relational partitions. The two partition strategies do not need to be identical, but the partition management application is easier to design, build, and understand if the partitions are similar. A common strategy is to partition identically along date in both systems, with optionally a slice along a second or even third dimension in the cube.
The simplest strategy is to use the UNION ALL view as the source fact table for all cube partitions. If cube partitions are aligned with the relational partitions, each cube partition could point directly to its associated relational partition, circumventing the UNION ALL view. In this configuration the cube processing query that extracts data from the relational database will run fastest. The tradeoff for this performance improvement is that the maintenance application needs to ensure the source table is correctly associated with each partition.
If the relational database exists only to populate Analysis Services cubes and does not service any other queries, the system administrator may choose not to create and manage the UNION ALL view. Indexes on the relational tables would be designed to optimize the single query that loads data into the cube. In this case, the relational database is serving more as a staging area than a complete data warehouse.
Storage Modes and Aggregation Plans
Each partition can have its own storage and aggregation plan. Infrequently accessed data can be lightly aggregated, or stored as ROLAP or HOLAP rather than MOLAP, in both cases saving storage space. A cube that is incrementally loaded over time will not likely use this functionality along the time dimension of its partitions, as changing these parameters would require the partition to be reprocessed. The cost of processing time and system complexity would hardly seem to warrant the minimal space savings in most situations.
Partitions along other dimensions, by contrast, are likely to have different aggregation plans. The Usage-Based Optimization Wizard designs aggregations for each partition. The system administrator should focus the optimization wizard on the most recent partitions, and always base the aggregation design for each new set of partitions on the most current partitions, to keep the aggregation design as up-to-date as possible.
Managing the Partitioned Cube
The developer can use a variety of tools to build the management system for relational partitions. SQL-DMO is strongly recommended, but effective systems have been built using stored procedures, extended stored procedures, even Perl scripts that parse text files containing table definitions. The cube partition maintenance program, by contrast, must use DSO.
For system developers who come from a classic database background, the notion of using an object model to instantiate database objects may seem strange. The developer can use a familiar scripting language, such as Microsoft Visual Basic® Scripting Edition (VBScript), Microsoft JScript®, or Perl; or a development environment like Visual Basic (VB) or Microsoft Visual C++®, to develop the modules that use DMO and DSO. These modules can be scheduled from the operating system, from SQL Agent, or called from DTS packages. The requirement to use DSO to build the management system should not be viewed as a reason to forego the use of partitions, even if the developer has never used an object model before. A VBScript sample that illustrates how to use scripting to clone partitions is provided later in this chapter.
If the relational data warehouse uses partitions, the cube partition management system should be designed as part of the relational database partition management system. The cube partition management system must perform the following functions:
Create new partitions as necessary, typically on a schedule related to the Dates dimension.
Load data into the partitions.
Drop old partitions (optional).
Merge partitions (optional).
Create New Partitions
At the same time the partition management system creates a new date partition in the relational database, it should create all the necessary cube partitions corresponding to that date. It is good practice to incrementally update the cube's dimensions before creating new partitions, as a new dimension member may be added along one of the partition slices.
The simplest case is when the cube is partitioned only by date. The partition management system simply creates one new partition on the appropriate schedule (day, week, month, etc.).
If the cube is partitioned by another dimension in addition to the date, the partition management system will be adding many partitions at a time. For example, consider a cube that is partitioned by month and by state within the U.S. Each month the system will create 50 new state partitions. In this case, it is safe to create this month's partitions by cloning last month's partitions, editing the necessary attributes such as slice and source table name, and updating the partition definition in the cube.
However, consider a cube that is partitioned by month and product brand. Product brands are much more volatile than states or provinces; it is reasonable to expect that a new brand would be added to the product hierarchy during the life of the cube. The maintenance application must ensure that a partition is created to hold this new brand's data. The recommended practice is to:
Process the dimensions before creating the new partitions.
Clone existing partitions to ensure continuity in storage modes and aggregation plans.
Search the processed dimension for new members, creating a partition for any new members of the partitioning level. The system would have to specify default storage mode and aggregation plan.
The partition management system must be carefully designed to ensure that partition slice and filter definitions are aligned and remain accurate over time. If the relational database is partitioned, and those partitions are periodically merged as described earlier in this paper, the partition management system should update the cube partition definitions to synchronize with the source data. The cube partition need not be reprocessed, but the definition should be changed in case reprocessing becomes necessary in the future.
It is the job of the cube design and the partition management system to ensure that data are processed into one and only one partition. Analysis Services does not check that all rows from a fact table are instantiated in the cube, nor does it verify that a row is loaded into only one partition. If a fact row is inadvertently loaded into two partitions, Analysis Services will view them as different facts. Any aggregations will double-count that data, and queries will return incorrect results.
Processing a partition is fundamentally the same as processing a cube. The natural unit of work for a processing task is one partition. The Analysis Manager processing wizard provides the following three modes for processing a cube or partition.
Incremental update adds new data to the existing cube or partition, and updates and adds aggregations affected by that new data.
Refresh Data drops all data and aggregations in the cube or partition, and rebuilds the data in the cube or partition.
Full Process completely rebuilds the structure of the cube or partition, and then refreshes the data and aggregations.
Incremental processing requires that the administrator define a filter condition on the source query, to identify the set of new data for the cube. Usually this filter is based on a date, either the event date or a processing date stored in the fact table.
Exactly this same functionality is available from the Analysis Services Processing task. Most systems use the Analysis Services Processing task to schedule the cube processing. Incrementally processed cubes use the Dynamic Properties task to change the source filter. This same functionality is available from custom coding in DSO as well, although the incremental update requires a few more lines of code than refreshing the data does.
When designing the partition management system, it's important to note that incremental cube or partition processing requires that the partition have been processed in the past. Do not use incremental processing on an unprocessed cube or partition.
A cube that is partitioned only along the Dates dimension has straightforward load management requirements. Typically there is a single partition to update for each load cycle; the only decision point is whether to incrementally update or refresh the data. Most Date-dimensioned cubes will be managed from a simple DTS package.
A cube that is partitioned along multiple dimensions has the following additional challenges and benefits:
Challenge: Large number of partitions to process
Challenge: Potentially changing number of partitions
Benefit: Parallel loading of partitions
Benefit: Greatly improved query performance on highly selective queries
Most applications that partition on multiple dimensions design the cube processing system to load partitions in parallel. A parallel loading system could launch multiple simultaneous DTS packages whose parameters have been updated with the Dynamic Properties task. While feasible, this structure is awkward, and many systems will choose instead to use native DSO code to update the partitions. A sample tool to process partitions in parallel is provided later in this chapter.
A cube that is partitioned along Date will see the number of its partitions grow over time. As discussed above, there is theoretically a point at which query performance degrades as the number of partitions increase. Our testing, including the development of a cube with over 500 partitions, has not reached this limit. The system administrators will probably rebel before that point is reached, as the other disadvantage of many partitions – slowness of meta data operations – will make it increasingly difficult to manage the database.
Analysis Services, both through DSO and the Analysis Manager, support the ability to merge partitions. When two partitions are merged, the data from one partition is incorporated into a second partition. Both partitions must have identical storage modes and aggregation plans. Upon completion of the merge, the first partition is dropped and the second partition contains the combined data. The merge processing takes place only on the cube data; the data source is not accessed during the merge process. The process of merging two partitions is very efficient.
If the system design includes merged partitions, the merging process should occur programmatically rather than through Analysis Manager. Merging partitions is straightforward, and like other DSO operations requires few lines of code. The partition merging system must take the responsibility for verifying that the final merged partition contains accurate meta data information for the source filter, to ensure that the partition could be reprocessed if necessary. The partition merge process correctly changes the slice definition, and combines Filter definitions as well as it can. But the merge process does not require that both partitions be populated from the same table or data source, so it is possible to merge two partitions that cannot be repopulated.
A second issue to consider is that the merged partition, like all partitions, cannot be renamed.
These problems can be avoided by using the following good system design practices:
Use clear naming conventions.
Follow a consistent partition merging plan.
Take care to match up cube partitions with relational partitions, or do not partition the relational data warehouse.
For example, consider a Sales cube that partitions data by week. The current week is partitioned by day, and then merged at the end of the week. Our partitions are named Sales_yyyymmdd, where the date in the name is the first day of the data in the partition. In November 2000, we will have weekly partitions Sales_20001105, Sales_20001112, Sales_20001119, and Sales_20001126. During the next week, we create and process Sales_20001203, Sales_20001204, and so on through Sales_20001209. During the Sunday processing window, when there is little use of the system, we can merge 20001204 through 20001209 into Sales_20001203, leaving only the weekly partition. Alternatively, you could effectively rename a partition by creating a new empty partition with the desired name, and merging other partitions into it.
Rolling Off Old Partitions
Deleting old data in a cube partitioned by Date is as simple as dropping the oldest (set of) partitions. Like the other operations we have discussed, this process should be managed programmatically rather than on an ad hoc basis through Analysis Manager.
Using local partitions is recommended for medium to large Analysis Services cubes, containing more than 100 million fact rows. Query performance of the Analysis Services database improves with partitioning. It is easier to maintain partitioned cubes, especially if old data are dropped from the cube. However, partitioning a cube requires an application to manage those partitions.
Partitioning in the relational data warehouse database is similar in concept to partitioning in Analysis Services. As with Analysis Services, an application must be built to manage relational partitions. Partitioning addresses some maintenance problems such as pruning old data, but at the cost of system complexity. Query performance is not improved compared to a well-indexed single table.
Both Analysis Services and the SQL Server relational database support distributed partitions, wherein partitions are located on different servers. We do not recommend distributing relational partitions for a SQL Server 2000 data warehouse system that supports direct queries.
Partitioned cubes exhibit improved query performance with large numbers of partitions. The developer of a large cube should consider partitioning along several dimensions, to maximize the selectivity of user queries and improve processing performance by providing the opportunity for parallel processing.
Partitions are strongly recommended for large Analysis Services systems. Partitioning the relational data warehouse is not generally recommended, although it can be an effective and well-performing solution to some specific warehouse maintenance issues.
For More Information
SQL Server Books Online contains more information about partitions. For additional information, see the following resources.
The Microsoft SQL Server Web site at http://www.microsoft.com/sql/.
The Microsoft SQL Server Developer Center at http://msdn2.microsoft.com/sqlserver/default.aspx.
SQL Server Magazine at http://www.sqlmag.com.
The microsoft.public.sqlserver.server and microsoft.public.sqlserver.datawarehouse newsgroups at news://news.microsoft.com.
The Microsoft Official Curriculum courses on SQL Server. For up-to-date course information, see http://www.microsoft.com/learning/default.asp.
VBScript Code Example for Cloning a Partition
Code Example 18.1
'/********************************************************************* ' File: ClonePart.vbs ' ' Desc: This sample script creates a new partition in the FoodMart 2000 ' Sales cube, based on the latest partition in the cube. The ' purpose of the script is to show the kinds of DSO calls that ' are used to clone a partition. The resulting partition is ' processed, but adds no data to the cube. ' ' Users of this script may want to delete the resulting partition ' after running the script and exploring the results. ' ' Parameters: None '*********************************************************************/ Call ClonePart Sub ClonePart() On Error Resume Next Dim intDimCounter, intErrNumber Dim strOlapDB, strCube, strDB, strAnalysisServer, strPartitionNew Dim dsoServer, dsoDB, dsoCube, dsoPartition, dsoPartitionNew ' Initialize server, database, and cube name variables. strAnalysisServer = "LocalHost" strOlapDB = "FoodMart 2000" strCube = "Sales" ' VBScript does not support direct use of enumerated constants. ' However, constants can be defined to supplant enumerations. Const stateFailed = 2 Const olapEditionUnlimited = 0 ' Connect to the Analysis server. Set dsoServer = CreateObject("DSO.Server") dsoServer.Connect strAnalysisServer ' If connection failed, then end the script. If dsoServer.State = stateFailed Then MsgBox "Error-Not able to connect to '" & strAnalysisServer _ & "' Analysis server.", ,"ClonePart.vbs" Err.Clear Exit Sub End if ' Certain partition management features are available only ' in the Enterprise Edition and Developer Edition releases ' of Analysis Services. If dsoServer.Edition <> olapEditionUnlimited Then MsgBox "Error-This feature requires Enterprise or " & _ "Developer Edition of SQL Server to " & _ "manage partitions.", , "ClonePart.vbs" Exit Sub End If ' Ensure that a valid data source exists in the database. Set dsoDB = dsoServer.mdStores(strOlapDB) If dsoDB.Datasources.Count = 0 Then MsgBox "Error-No data sources found in '" & _ strOlapDB & "' database.", , "ClonePart.vbs" Err.Clear Exit Sub End If ' Find the cube. If (dsoDB.mdStores.Find(strCube)) = 0 then MsgBox "Error-Cube '" & strCube & "' is missing.", , _ "ClonePart.vbs" Err.Clear Exit Sub End If ' Set the dsoCube variable to the desired cube. Set dsoCube = dsoDB.MDStores(strCube) ' Find the partition If dsoCube.mdStores.Count = 0 Then MsgBox "Error-No partitions exist for cube '" & strCube & _ "'.", , "ClonePart.vbs" Err.Clear Exit Sub End If ' Set the dsoPartition variable to the desired partition. Set dsoPartition = dsoCube.MDStores(dsoCube.MDStores.Count) MsgBox "New partition will be based on existing partition: " _ & chr(13) & chr(10) & _ dsoDB.Name & "." & dsoCube.Name & "." & _ dsoPartition.Name, , "ClonePart.vbs" ' Get the quoting characters from the datasource, as ' different databases use different quoting characters. Dim sLQuote, sRQuote sLQuote = dsoPartition.DataSources(1).OpenQuoteChar sRQuote = dsoPartition.DataSources(1).CloseQuoteChar '********************************************************************* ' Create the new partition based on the desired partition. '********************************************************************* ' Create a new, temporary partition. strPartitionNew = "NewPartition" & dsoCube.MDStores.Count Set dsoPartitionNew = dsoCube.MDStores.AddNew("~temp") ' Clone the properties from the desired partition to the ' new partition. dsoPartition.Clone dsoPartitionNew ' Change the partition name from "~temp" to the ' name intended for the new partition. dsoPartitionNew.Name = strPartitionNew dsoPartitionNew.AggregationPrefix = strPartitionNew & "_" ' Set the fact table for the new partition. dsoPartitionNew.SourceTable = _ sLQuote & "sales_fact_dec_1998" & sRQuote ' Set the FromClause and JoinClause properties of the new ' partition. dsoPartitionNew.FromClause = Replace(dsoPartition.FromClause, _ dsoPartition.SourceTable, dsoPartitionNew.SourceTable) dsoPartitionNew.JoinClause = Replace(dsoPartition.JoinClause, _ dsoPartition.SourceTable, dsoPartitionNew.SourceTable) ' Change the definition of the data slice used by the new ' partition, by changing the SliceValue properties of the ' affected levels and dimensions to the desired values. dsoPartitionNew.Dimensions("Time").Levels("Year").SliceValue = "1998" dsoPartitionNew.Dimensions("Time").Levels("Quarter").SliceValue = "Q4" dsoPartitionNew.Dimensions("Time").Levels("Month").SliceValue = "12" ' Estimate the rowcount. dsoPartitionNew.EstimatedRows = 18325 ' Add another filter. The SourceTableFilter provides an additional ' opportunity to add a WHERE clause to the SQL query that will ' populate this partition. We're using this filter to ensure our new ' partition contains zero rows. For the purposes of this sample code ' we don't want to change the data in the FoodMart cube. Comment out ' this line if you want to see data in the new partition. dsoPartitionNew.SourceTableFilter = dsoPartitionNew.SourceTable _ & "." & sLQuote & "time_id" & sRQuote & "=100" ' Save the partition definition in the metadata repository dsoPartitionNew.Update ' Check the validity of the new partition structure. IF NOT dsoPartitionNew.IsValid Then MsgBox "Error-New partition structure is invalid." Err.Clear Exit Sub End If MsgBox "New partition " & strPartitionNew & " has been created and " _ & "processed. To see the new partition in Analysis Manager, you " _ & "may need to refresh the list of partitions in the Sales cube " _ & "of FoodMart 2000. The new partition contains no data.", , _ "ClonePart.vbs" ' The next statement, which is commented out, would process the partition. ' In a real partition management system, this would likely be a separate ' process, perhaps managed via DTS. ' dsoPartitionNew.Process ' Clean up. Set dsoPartition = Nothing Set dsoPartitionNew = Nothing Set dsoCube = Nothing Set dsoDB = Nothing dsoServer.CloseServer Set dsoServer = Nothing End Sub