Chapter 13 - OLAP Services
Online analytical processing (OLAP) is an increasingly popular technology that can improve business analysis dramatically, but one that has been characterized historically by expensive tools, difficult implementation, and inflexible deployment. Microsoft has addressed the OLAP problem and created a solution that makes multidimensional analysis accessible to a broader audience and potentially at a significantly lower cost of ownership.
What is OLAP Services?
Microsoft SQL Server OLAP Services is a fully featured OLAP capability that is a component of Microsoft SQL Server version 7.0. OLAP Services includes a middle-tier server that allows users to perform sophisticated analyses on large volumes of data, with exceptional performance. A second feature of OLAP Services is a client cache and calculation engine called PivotTable Service, which helps improve performance and reduce network traffic. PivotTable Service allows users to conduct analyses while disconnected from the corporate network.
OLAP is a key component in the data warehousing process, and OLAP Services provides essential functionality for a wide array of applications ranging from corporate reporting to advanced decision support. The inclusion of OLAP functionality within SQL Server can help make multidimensional analysis more affordable and bring the benefits of OLAP to a wider audience. This includes not only smaller organizations, but also groups and individuals within larger corporations that have been excluded from the OLAP industry by the cost and complexity of today's products.
Coupled with a wide variety of tools and applications that support OLAP applications through the Microsoft OLE DB for OLAP interface, OLAP Services can increase access to sophisticated analytical tools and can reduce the costs of data warehousing.
Rationale for OLAP Systems
Historically, the bulk of investment in corporate computing has been in systems that generate or capture data, such as accounting, order processing, manufacturing, and customer information. Increasingly, organizations are investing in applications and technologies that deliver additional value from this collected data. Data warehousing is the process of collecting, cleansing, and sifting data from a variety of operational systems and making the resultant information available to a wide audience of end users for analysis and reporting. Data warehouse and data mart are terms used to describe these nonvolatile stores of cleansed and summarized information that are available for browsing by users.
Microsoft Data Warehousing Strategy
Several years ago, Microsoft undertook a pair of initiatives with the overall goal of expanding the availability of data warehousing and decision support capabilities in the business world. The two initiatives are the Microsoft Data Warehousing Framework, which is a roadmap for Microsoft product development, and the Microsoft Alliance for Data Warehousing, which is a coalition of industry businesses committed to the Microsoft platform and the Data Warehousing Framework for development and marketing purposes. The initiatives were based on a central strategy of Microsoft Corporation contributing to the data warehousing process by:
Lowering the costs of acquisition, implementation, and maintenance.
Redefining scalability to serve not only the large systems, but also to serve the individual user.
Increasing the integration tools provided by third-party vendors.
Microsoft Data Warehousing Framework
The Data Warehousing Framework is an open architecture that describes mechanisms for sharing data and metadata in the construction and management of data warehouses and data marts. The essential technologies underlying the Data Warehousing Framework are the OLE DB data interfaces and the instance of Microsoft Repository running on SQL Server.
Repository is a database that stores descriptive information about software components and their relationships (metadata). Metadata models have been defined in Repository for database schemas, data transformations, and OLAP database schemas.
The Data Warehousing Framework components represent integral steps in the data warehousing process, some of which are being delivered by Microsoft but that can be extended easily by Microsoft customers and third-party businesses using alternative technology.
SQL Server 7.0 provides many of the basic components required for building and maintaining a data warehouse: database design with a graphical schema designer; high-capacity data storage; data transformation capabilities through Data Transformation Services (DTS); OLAP capabilities through OLAP Services; and so on.
For more information about the Data Warehousing Framework, see "Data Warehousing Framework" earlier in this volume.
Although the data warehousing process prepares data for end user consumption, most information in a relational data warehouse is not easily browsed.
Data structures are often difficult for the end user to comprehend and questions such as "Who are the top sales people in each region for the last year by month?" are complex when expressed in SQL. Some of these challenges can be addressed with advanced query tools, which hide the database complexity from the end user, but for the larger class of applications in which the end user is viewing multidimensional data, Microsoft believes the optimal solution is OLAP technology.
All organizations, regardless of size, must manage complex multidimensional data. Even the smallest organization may need to track sales by product, salesperson, geography, customer, and time. Organizations have sought tools to access, navigate, and analyze multidimensional data in an easy, natural way.
OLAP is not a new concept, but the OLAP name has been given to this technology only recently. In 1993, Dr. E. F. Codd, the database researcher and inventor of the relational database model, coined the term in his paper, "Providing OLAP to User Analysis: An IT Mandate," wherein he laid out 12 rules that defined the characteristics of OLAP applications. Nigel Pendse and Richard Creeth of the OLAP Report (http://www.olapreport.com/fasmi.htm) later refined his definition with what is called the FASMI test. This test states that OLAP applications should deliver fast analysis of shared multidimensional information following these guidelines:
Delivers information to the user at a fairly constant rate. Most queries should be delivered to the user in five seconds or less.
Performs basic numeric and statistical analysis of the data, predefined by an application developer or defined ad hoc by the user.
Implements the security requirements necessary for sharing potentially confidential data across a large user population.
Is the essential characteristic of OLAP.
Accesses all the data and information necessary and relevant for the application, wherever it may reside and not limited by volume.
Value to Organizations
OLAP provides organizations with a means of accessing, viewing, and analyzing data with high flexibility and performance. First and foremost, OLAP presents data to end users through a natural and intuitive data model. Using this navigational style, end users can view and understand the information in their data warehouses more effectively, thereby allowing organizations to better recognize the value of their data.
Second, OLAP accelerates the delivery of information to end users viewing these multidimensional structures by preparing some computed values in the data in advance, rather than at execution time. The combination of easy navigation and fast performance allows end users to view and analyze data more quickly and efficiently than is possible with relational database technology only. The end result is more time spent analyzing data and less time analyzing databases.
OLAP Data Model
In an OLAP data model, information is viewed conceptually as cubes that consist of descriptive categories (dimensions) and quantitative values (measures). The multidimensional data model makes it easy for users to formulate complex queries, arrange data on a report, switch from summary to detail data, and filter or slice data into meaningful subsets. For example, typical dimensions in a cube containing sales information include time, geography, product, channel, organization, and scenario (budget or actual). Typical measures include dollar sales, unit sales, inventory, headcount, income, and expense.
Within each dimension of an OLAP data model, data can be organized into a hierarchy that represents levels of detail on the data. For example, within the time dimension, you may have the levels years, months, and days; similarly, within the geography dimension, you may have the levels country, region, state/province, and city. A particular instance of the OLAP data model would have the specific values for each level in the hierarchy. A user viewing OLAP data can move up or down between levels to view information that is either more or less detailed.
Aggregation and Storage Models
Cubes, dimensions, hierarchies, and measures are the essence of the multidimensional navigation of OLAP.
By describing and presenting data in this fashion, users can navigate a complex set of data intuitively. However, describing the data model intuitively does little to speed delivery of the information to the user.
A key tenet of OLAP is that users should see consistent response times for each view, or slice, of the data they request. Because data is usually collected at the detail level only, the information summary usually is computed in advance. These precomputed values, or aggregations, are the basis of the OLAP performance gains.
In the early days of OLAP technology, most vendors assumed that the only possible solution for OLAP applications was a specialized, nonrelational storage model. Later, other vendors discovered that through the use of database structures (star and snowflake schemas), indexing, and storage of aggregates, relational database management systems (RDBMS) could be used for OLAP. These vendors called their technology Relational OLAP (ROLAP). The earlier OLAP vendors then adopted the term multidimensional OLAP (MOLAP).
MOLAP implementations usually outperform ROLAP technology, but have problems with scalability. On the other hand, ROLAP implementations are more scalable and are often attractive to customers because they leverage investments in existing relational database technology.
A recent development has been a hybrid OLAP (HOLAP) solution, which combines the ROLAP and MOLAP architectures to yield a solution with the best features of both: superior performance and extensive scalability. One approach to HOLAP maintains detail records (the largest volumes) in the relational database, while maintaining aggregations in a separate, MOLAP store.
OLAP Services Architecture
Microsoft SQL Server OLAP Services has been designed from the ground up to help minimize the most significant ownership costs of building and maintaining OLAP applications. OLAP Services consists of both server and client (middle-tier) software components.
On the server side, OLAP Services OLAP server operates as a Microsoft Windows NT service and provides the core computational functionality. Programmatic access to administrative functions in the OLAP server is through an object model called Decision Support Objects (DSO), which is documented by Microsoft.
OLAP Manager, the built-in administrative user interface for OLAP Services, also is developed on DSO and provides a rich user experience without the need for programming. OLAP Manager, which can execute on a computer separate from the OLAP server, allows the database administrator to design OLAP data models, access information in RDBMS stores, design aggregations, and populate OLAP data stores, among other functions. The OLAP metadata definitions are stored in a private repository, but can be exported to Repository with the OLAP Open Information Model (OIM) by using a simple utility.
OLAP Services can access source data in any supported OLE DB data provider, which includes not only SQL Server but also a large number of desktop and server databases, including Microsoft Access, Microsoft FoxPro, Oracle, Sybase, and Informix. Any database source that provides an Open Database Connectivity (ODBC) interface is also accessible through a facility in OLE DB that wraps ODBC drivers and exposes them as if they were native OLE DB interfaces. These data sources can also reside on platforms other than the Windows NT operating system, for example, UNIX or mainframe systems and databases like IBM DB2 or Teradata. Through the multiplatform capabilities of OLE DB, data can be accessed from a wide variety of systems as if they were local to the OLAP Services server.
On the client side, OLAP Services includes a component called PivotTable Service. PivotTable Service is the facility that connects OLAP client applications to the OLAP Services server. All access to data managed by OLAP Services, by custom programs or client tools, is through the OLE DB for OLAP interface provided by PivotTable Service.
Both the client and server components of OLAP Services are extensible in functionality. Customer sites, independent software vendors (ISVs), and consultants all can extend the calculation, data management, or application functionality using the well-documented features in DSO. With this built-in extensibility, customers can be assured that OLAP Services will have the functionality necessary to address their application needs.
Challenges of OLAP Implementation
There are several challenges that must be overcome in the implementation of OLAP:
Building the OLAP data model
Performance and scalability
Building the OLAP Data Model
A fundamental challenge in OLAP implementation is mapping the initial database schema to the multidimensional model. This requires a significant programming effort with many of the products on the market today. In the evolution of OLAP products, OLAP database design has become a specialized and arcane process, intricately linked to the specific OLAP technology being deployed. Consequently, OLAP database developers are specialized, which has led to high costs in developing applications, concentrated at the data design stage.
In most OLAP implementations, it is assumed that the data has been prepared for analysis through data warehousing, whereby information has been extracted from operational systems, cleansed, validated, and summarized prior to incorporation into an OLAP application. This is a vital step in the process, which ensures that the data being viewed by the OLAP user is correct, consistent, and matches organizational definitions for the data.
Increasingly, information in a data warehouse is organized in star (or snowflake) schemas, which simplify user understanding of the data, maximize performance for decision support applications, and require less storage for large databases.
The following illustration is an example of a star schema. In this database schema, a central fact table is linked to related dimension tables.
A star (snowflake) schema is a relational approximation of the OLAP data model and can be an excellent starting point for building OLAP cube definitions. Few OLAP products, however, have taken advantage of this trend. Generally, they have not provided easy tools to map a star schema to an OLAP model, and as a result keep the cost of building OLAP models extremely high and the development time unnecessarily long.
Intuitive User Interfaces
One of the key differences in Microsoft SQL Server OLAP Services version 7.0 is the OLAP Manager user interface, which has been created with the infrequent OLAP database administrator in mind. The OLAP Manager is delivered as a snap-in to the Microsoft Management Console (MMC), and it shares the same administrative user interface as the entire Microsoft BackOffice family of products. The obvious benefit is that the OLAP database administrator is better able to translate skills from SQL Server and other Microsoft products.
More value becomes apparent when the power and flexibility of MMC are understood. OLAP Services includes a full range of taskpads that guide the novice or infrequent user through common tasks. OLAP Services also includes a full tutorial on OLAP concepts and a step-by-step guide to building an OLAP cube. A full complement of wizards is available for automating the most common activities, such as creating a dimension definition.
Furthermore, OLAP Services is optimized for developing data warehouses in which star or snowflake schemas have been designed. The Cube Wizard is especially suited to these prebuilt schemas, and translation to the multidimensional model is extremely rapid. OLAP Services can easily accommodate other source schemas should they be encountered.
To ensure successful interpretation of the OLAP Services user interface concepts, Microsoft conducted usability tests. Finally, large-scale beta testing has provided broad exposure and customer input to OLAP Services. As a result of the energy spent on the database administrator requirements, most users are able to build their first cube in less than 30 minutes.
Managing Data Explosion with Aggregation
As noted earlier, precomputing aggregations is a key performance strategy for most OLAP products. However, preaggregation comes at a significant cost: The number of aggregations can easily outstrip the number of initial detail points, and the volume of data stored expands dramatically. The number of aggregations in an OLAP model is a function of the number of dimensions, the number of levels in the hierarchies, and the parent-child ratios. For more information, see the OLAP Report at http://www.olapreport.com.
Real-world examples of the effects of this data explosion abound. A recently published standard benchmark test of another OLAP product resulted in a data explosion factor of 240, requiring 2.4 gigabytes (GB) of storage to manage 10 megabytes (MB) of input data. Providing adequate disk storage to handle the huge expansion in data is a significant cost of large-scale OLAP implementations, and it creates distinct limits on the ability of an organization to analyze all the desired source-level data.
Because of data explosion, OLAP applications can suffer even more when the source or detail data is distributed sparsely throughout the multidimensional cube. Missing or invalid data values create sparsity in the OLAP data model. In the worst case, an OLAP product would nonetheless save an empty value. For example, a company may not sell all products in all regions, so no values would appear at the intersection where products are not sold in a particular region.
Reducing data sparsity is a challenge that has been met with varying degrees of success by different OLAP vendors. The worst implementation results in databases that stored empty values, thus having low density and wasting space and resources. OLAP Services does not store empty values, and as a result, even sparsely populated cubes do not balloon in size. While this issue is frequently highlighted by some OLAP vendors as a deciding factor in OLAP architectures, the differences between vendor implementations of sparsity management is minor compared to the more significant data explosion caused by precalculating too many aggregates.
Flexible Storage Choices
Microsoft believes OLAP Services leads the market in offering a flexible solution that allows the OLAP database administrator to decide which storage model is most appropriate. OLAP Services supports a full MOLAP implementation, full ROLAP implementation, or a HOLAP solution. For example, a database administrator may opt to put frequently accessed data in MOLAP and historical data, which has more scalability problems, in ROLAP. However, the underlying data model is completely invisible to the client application, and its user perceives only cubes.
Whether one chooses to implement a MOLAP, ROLAP, or HOLAP data model, the integration of OLAP Services with relational databases is superior. OLAP Services maintains strong links between the source data, the OLAP multidimensional metadata, and the aggregations themselves by tying the graphical user interface design tools and wizards directly to OLE DB.
When implementing ROLAP data models, OLAP Services defines, populates, and maintains all of the relational database structures. This frees the developer from having to perform these tasks, or worse, from having to manage complex queries across multiple tables and servers.
Also, OLAP Services has minimized a fundamental problem of OLAP technology: data explosion caused by excessive preaggregation. As described earlier, OLAP data explosion is the result of multidimensional preaggregation. In traditional OLAP systems, data that has not been preaggregated is not available for reporting and analysis purposes unless calculated at run time. By precalculating and storing all possible combinations of aggregates (for example, the sum of all products and product levels across all time periods, across all organizations, across all distribution channels, and so on), traditional OLAP products create a massive data explosion.
In contrast to the approach of calculating all possible aggregations, OLAP Services determines which aggregations provide the greatest performance improvements, but also allows the OLAP database administrator to make a trade-off between system speed and the disk space required to manage aggregations through the Storage Design Wizard. (See the following illustration.) If the developer were to precalculate all aggregations, disk space requirements would be maximized (hence the data explosion syndrome). On the other hand, if the developer were to precalculate nothing, disk requirements would be zero, but performance would not be improved.
In most cases, OLAP Services can gain about 80 percent improvement in query performance without excessive precomputation of aggregations (the exponential explosion of data typically occurs during the last 20 percent of aggregations). OLAP Services analyzes the OLAP metadata model and heuristically determines the optimum set of aggregations from which all other aggregations can be derived. The result is that OLAP Services derives nonaggregated data from a few existing aggregate values rather than having to scan the entire data warehouse. This strategy of partial preaggregation, however, is only the starting point.
Although the OLAP Services heuristics are excellent, they are based on mathematical models that may or may not correspond to actual usage patterns. To optimize performance according to actual usage patterns, OLAP Services optionally logs queries sent to the server. These logs then can be used to fine-tune the set of aggregations that OLAP Services maintains. For example, the Usage-Based Optimization Wizard allows the database administrator to tell OLAP Services to create a new set of aggregations for all queries that take longer than n seconds to answer (where n might be 10 seconds or more).
Disk space can be purchased whereas time cannot. If the extraction time of critical information outweighs the cost of purchasing more disk space, the solution is apparent.
The OLAP Services solution for data explosion reduces the time required to process initial loads and incremental updates, as well as minimizes the amount of disk space necessary. If an application begins with a 10-GB data warehouse and generates 10 GB of aggregations, then the processing time required is a fraction of that required to process the fully exploded set of aggregations.
OLAP Services also has taken an innovative approach to data sparsity. Although the details of the internal implementation are proprietary, the net result is that both MOLAP and ROLAP implementations manage storage requirements extremely well, often resulting in databases with smaller OLAP storage needs than the original detail data.
Virtual cubes can be used whenever the user wants to view joined information from two dissimilar cubes that share as few as one common dimension. Similar in concept to a relational view, virtual cubes are two or more cubes linked at query time along one or more common dimensions. One benefit of virtual cubes applies to situations in which data sparsity is a significant problem. For example, a cube that contains measures for sales by unit and selling price could also have a measure for list price to compute discounts, but the list price value would be repeated many times. By building a list price cube that is joined in a virtual cube with the sales by unit and selling price information, the database administrator can eliminate much of the data redundancy. The ability to create virtual cubes means that many unnecessary values can be eliminated from the OLAP data storage altogether.
Performance and Scalability
The specific performance metrics of an OLAP application are a function of several factors, including database size, hardware computing power, and disk space allocated to preaggregated data. However, in real-world implementations, OLAP Services–based applications respond to most queries in less than 5 seconds and to nearly all queries within 10 seconds.
The OLAP Services implementation of partitioned cubes makes the technology highly scalable. A partitioned cube allows one logical cube of data to be spread over multiple physical cubes and even separate physical servers. In response to a user query, OLAP Services distributes the query among the partitioned servers, allowing the data to be retrieved in parallel.
For example, consider the case of an application tracking telephone calls for 10 geographic regions, where one would expect several million calls per day. For the purpose of analysis, one could partition the data among 10 servers, each containing the data for a particular region. From the user perspective, however, there is simply one logical cube of data. In response to user requests for this information, OLAP Services seamlessly transforms the queries as appropriate for each of the 10 servers and returns a single result set to the user. Each of the 10 databases is also available for separate access by analysts seeking information for that particular region.
Delivering OLAP Information to the User
Historically, because OLAP server technology has been tightly linked to proprietary client technology, customers have had little choice in their selection of mixed, best-of-breed products. This has led to high implementation costs and often inadequate choices for applications that require both client/server and Web-based OLAP information delivery. As was recognized years ago in the relational database industry, a common interface is necessary to promote openness in the selection of applications and databases. The industry standard became ODBC.
The issue of openness in OLAP tools was first taken up in 1996, when a vendor consortium called the OLAP Council announced an interoperability standard, multidimensional application programming interface (MDAPI), which was meant to open the marketplace to greater vendor participation. In spite of much customer anticipation, the vendor community, including the members of the OLAP Council, generally shunned MDAPI.
Recognizing the need for a unifying standard that leveraged existing customer investments, Microsoft extended the definition of the existing OLE DB data access API to include multidimensional capabilities. In one year, Microsoft published two drafts of the API, sought feedback from vendors and the public, and ultimately delivered a final version in February 1998 that was endorsed by 18 vendors at beta release.
Disconnected and Web-based Delivery
Individuals often need to analyze data multidimensionally while they are disconnected from the corporate network, such as when traveling with a laptop computer. Disconnected users typically want to view and analyze small slices of their entire cube; for example, a sales manager may want to view a revenue summary for a particular region while visiting a regional office. The need is so common that desktop OLAP (DOLAP), which does not require a shared server for multidimensional data access, was created.
Most OLAP server technologies do not provide transparent creation of DOLAP cubes. As a result, this step has been left as yet another development-intensive effort or relegated to OLAP client tools that have added functionality to support desktop usage. Overall, this has increased the cost and complexity of delivering applications that require both connected and disconnected clients.
A popular viewing tool for any type of information, especially multidimensional information, is a Web browser. A key means of reducing the cost per user in large-scale OLAP applications, Web browsers open the world of multidimensional access to a wider audience. At present, there are some good products and tools for delivering OLAP data over an intranet, but there is no easy mechanism for an application developer to create custom OLAP viewing tools.
The OLAP Services server caches user queries and metadata as well as data. Cached query definitions and metadata make it possible for OLAP Services to answer new queries by calculating previously cached data rather than accessing the disk. For example, one user asks for sales data for January, February, and March. Another user asks for sales data for the first quarter. OLAP Services can summarize January through March from random access memory (RAM) faster than it can fetch the first-quarter data from disk.
OLAP Services is unique in that it provides much of the same functionality on the client. Every client connects to OLAP Services servers by using PivotTable Service. PivotTable Service acts as a driver to manage the connection between the client and server. PivotTable Service shares much of the same code as the OLAP Services server, bringing the server's multidimensional calculation engine, caching features, and query management directly to the client. The result is an innovative client/server, data-management model that optimizes performance and minimizes network traffic. This comes at a very small computing cost: The disk space required for PivotTable Service is approximately 2 MB, and the memory requirements are only 500 KB in addition to the cached data.
The OLAP Services intelligent client/server architecture is capable of determining how to answer a user request as quickly as possible, while eliminating redundant network traffic. The key to this architecture is shared metadata between client and server. When a user requests information from the server, both data and metadata (definitions of the cube structure) are downloaded to the client. Having the cube metadata present on the client allows PivotTable Service to resolve which requests need to be returned to the server for resolution.
For example, recall the three-month sales data scenario. Assume that both the OLAP Services server and the client application have just been started. When the user asks for sales data for January, February, and March, the data will be cached on both the server and the client. If the user then asks for data for the first quarter, PivotTable Service will derive the results locally (at the client) without sending the query to the server. If the user subsequently asks to see data from the first quarter of this year compared to that of last year, PivotTable Service has the intelligence to access the server for last year's data only.
PivotTable Service also provides the mechanism for disconnected usage. Portions of cubes defined and accessed from a server can be saved on the client for later access when disconnected from the network. In this way, business users can take portions of their database while travelling and still have complete analysis capabilities. In addition, PivotTable Service allows users to create simple OLAP models locally, accessing information in OLE DB compatible data sources, from flat files to desktop databases.
Finally, PivotTable Service provides the connectivity for Web-based applications. While OLE DB for OLAP is a low-level programming interface, a new extension to ActiveX Data Objects (ADO) provides multidimensional data access. This extension, called ADO/MD, can be used to create ActiveX controls in the Microsoft Visual Basic development system to browse, chart, or report on data in OLAP Services from a Web page. ADO/MD is the corporate application programmer's tool for gaining access to the full functionality of OLAP Services.
Affording OLAP Tools
Typically, OLAP products can cost $50,000 to $100,000 for less than 50 users. Microsoft recognizes that OLAP is a natural extension of database technology and has included OLAP Services as a feature of SQL Server 7.0.
SQL Server 7.0 includes other complementary data warehousing features:
Visual Database Tools, for creating database schemas
Data Transformation Services (DTS), for extracting and transforming operational data into the data warehouse
Microsoft Repository, for providing a unified metadata store built on SQL Server