Introduction to the Unified Dimensional Model (UDM)
Updated June 2005
Microsoft SQL Server 2005
Unified Dimensional Model (UDM)
Summary: This paper provides an introduction to the Unified Dimensional Model (UDM), including the basic end-user model, and a brief overview of the architecture and security models. (17 printed pages)
An end-user wishing to retrieve information directly from a data source, such as an Enterprise Resource Planning (ERP) database, faces a number of significant challenges:
- The contents of such data sources are often very hard to understand, being designed with systems and developers rather than end users in mind.
- Information of interest to the user is commonly distributed among multiple heterogeneous data sources. Even if dealing only with different relational databases, the user needs to understand the particulars of each (the dialect of SQL used, for example). Worse, those data sources might be of quite different types, including not only relational databases but files and web services.
- While many data sources are orientated towards holding large quantities of transaction level detail, many times the queries necessary to support business decision making involve summary, aggregated information. With increased data volumes, the time required to retrieve such summary values for interactive end-user analysis is often prohibitive.
- Business rules are generally not encapsulated in the data sources. Users are left to make their own interpretation of the data.
The role of a Unified Dimensional Model (UDM) is to provide a bridge between the user and the data sources. A UDM is constructed over one or more physical data sources, and then the end user issues queries against the UDM using one of a variety of client tools, such as Microsoft Excel.
Figure 1. The UDM provides a bridge between end users and their data
At a minimum, when the UDM is constructed merely as a thin layer over the data source, the advantages to the end user are a simpler, more readily understood model of the data, isolation from heterogeneous backend data sources, and improved performance for summary type queries. In some scenarios a simple UDM like this is constructed totally automatically. With greater investment in the construction of the UDM, additional benefits accrue from the richness of metadata that the model can provide.
This paper provides an introduction to the UDM. First, the basic end-user model provided by a UDM is described by means of an example scenario. Then aspects of such a scenario are examined in more detail to show how the UDM:
- Allows the user model to be greatly enriched.
- Provides high performance queries supporting interactive analysis, even over huge data volumes.
- Allows business rules to be captured in the model to support richer analysis.
- Supports 'closing the loop,' where the user acts upon the data they see.
Finally a brief overview of the architecture and security model is provided.
Consider an example where an end-user wishes to compare sales with quotas for different time periods.
The sales data is stored in the main Sales and Inventory database, of course containing many other tables. Even after identifying the relevant tables, we find that data for a single entity, such as Product, is smeared over numerous tables, and as referential integrity is enforced by the application logic, no relationships are defined between those tables. The Sales Quotas are stored in the database of another application. Neither database captures any business rules, such as the fact that in comparing quotas with actual sales, it is the date that the order shipped that must be used rather than the numerous other dates for orders (date ordered, date due, data scheduled, and so forth).
Figure 2. The data sources containing Sales and Sales Quotas
First, consider the case where the end user accesses the data sources directly. An example of a query being constructed using a sample tool is shown in the following illustration.
Figure 3. Querying the data sources directly (Click the image to see a larger version)
By this point, the user has made considerable progress. They have:
- Sifted through the large number of cryptically named tables to find the ones of interest.
- Identified which columns they think should be used to join the tables together.
- Selected those columns that contain the details of interest, often having to pull them out from a sea of system-orientated details. For example, among the 11 columns on the tables holding details of product categories, only the two name columns are actually relevant to a human.
The user is now engaged in defining where 'outer' versus 'inner' joins should be used, and how to group details to provide the required aggregates.
However, the worse is still to come. How to get data joined in from the other data source? Even if one of the databases supported distributed queries, it is well beyond most users to construct the required query, and tools generally fall shy of supporting the user in this.
SELECT Quotas.QuotaAmount, Quotas.EmployeeId, ... FROM OPENROWSET('SQLOLEDB','seattle1'; 'Sales';'MyPass', 'SELECT * FROM Forecasts.dbo.SalesQuota' ) As Quotas
Then considering other data sources, such as web services, the user is presented with another huge hurdle in determining how to make the necessary remote calls, and process the returned XML to join it in with the other data.
The final blow is that once this work is done for one query ('show total sales and quotas by category'), much of it must be redone for the next query ('..now break it down by employee'), and every succeeding query.
Accessing via a UDM
By contrast, an example of how building a query would appear to an end-user accessing a simple UDM constructed over these data sources is shown below. The UI shown in this example is drawn from the development tools provided with Microsoft SQL Server. It could equally well have shown other clients tools, such as Office Excel or Office Web Components (OWC), or any of the numerous other reporting and analysis tools that support the UDM.
Figure 4. Using a simple UDM over the data sources
The tree view on the left presents the contents of the UDM. Firstly, we note that:
- Only user-orientated, relevant items are exposed to the user. The 'system columns' like rowguid, or date last modified, are not visible.
- The names used are friendly names, not those resulting from the developer orientated naming conventions employed in the underlying database.
The UDM also groups all attributes for each business entity (such as Product, or Employee) into separate 'dimensions.' The client could therefore refer to the Product Color, Subcategory, and Category in this example without explicitly performing joins between the numerous tables involved.
Those columns that represent transaction values, or measurements, which a user is commonly interested in aggregating (such as 'Sales Amount' or 'Quota), are then presented as 'measures.' This method of presenting the data, as 'measures' and 'dimensions,' is what is known as Dimensional Modeling, and has been shown in the industry to be a successful model for end user understandability.
The right hand side of the diagram shows those elements included in the current query. In this case, to request 'Sales Amount and Quota by Product Category,' the user defined the query simply by dragging the three relevant items from the tree view. There was no need for the user to specify any of the details necessary to actually access the two different data sources and perform the necessary joins between the many tables involved.
The model certainly defines the simple formatting to be used by default (for example, the use of currency symbols). Richer formatting can also be defined, including conditional formatting (such as displaying a value in red if below a certain threshold).
The same model supports a wide variety of queries. For example, figures could be broken down by employee simply by dragging in an attribute from the employee dimension.
While the preceding example demonstrates how even a simple UDM can greatly simplify the basic exploration of data, there are additional challenges presented when providing end users access to data. For example:
- A UDM supporting many different types of queries from different users might itself grow to considerable size. How can we ensure a user with a particular focus is not overwhelmed with irrelevant information?
- How do we support the desire of global users to see reports in their native languages?
- How do we make it easy to ask all the common questions about time (for example, 'show sales compared with the same period last year').
This section examines some of these questions to show how the UDM supports more advanced data exploration.
While the consolidation of all the attributes of an entity into a dimension greatly simplifies the model for the user, there are additional relationships between the attributes that a simple list cannot express. In the preceding case, Category, SubCategory and SKU define one of the hierarchies in which products can be organized. As users often want to perform analysis based on such hierarchies, for example, first seeing totals by Category, then drilling down to SubCategory and subsequently to the lowest SKU level, the UDM allows the definition of such hierarchies. Each hierarchy is simply a sequence of attributes that can then be used in queries to ease such drill-down/drill-up scenarios.
An example of how hierarchies might appear in an end-user UI is shown below. Here the model contains several different hierarchies by which products are organized. The query ('show sales and quotas by product category, then broken down into subcategory') was defined by simply dragging the 'Products By Category' hierarchy, then double-clicking to expand the 'Bike' category to see the more detailed data.
Figure 5. A UDM including hierarchies
The UDM handles the details of how to navigate between levels of a hierarchy, as well as, in this case, the fact that Quotas are not available at the SubCategory level, but only per Category.
One special kind of hierarchy is a parent-child hierarchy, covering entities that have an involuted relationship to themselves. In the next illustration, the Employee dimension has a hierarchy 'Employees By Organization Structure.' Use of this hierarchy allows easy navigation of the parent-child relationship and analysis of the rolled up values. The quota for the VP of Sales, Charles Marshall, includes the sum of the quotas of all of his staff, plus any quotas associated directly with him.
Figure 6. A parent-child hierarchy
Users naturally apply categorizations to their data—for example, 'these attributes are all about an employees personal details, that attribute is an email address.' The UDM provides two mechanisms specifically aimed at allowing the end user tools to provide additional value based upon such categorizations, and upon the meaning of the data being presented:
- Dimensions, attributes and other objects can be placed into semantically meaningful categories, allowing more intelligent usage within a client tool. For example, an attribute can be marked as being a URL, whereupon the report could enable navigation based on the values of the URL. Another attribute might be marked as being an email address, allowing a reporting client to automatically open a new email upon some user action.
- Measures, hierarchies, and other objects can be grouped into folders meaningful to the user, allowing the reporting tool to display large numbers of attributes in a manageable way. For example, there might be a group of attributes 'Customer\Demographics.'
Time information is generally recorded in the underlying data source simply using DataTime or Date data types. While those users proficient in SQL (or XPath in the case of web services) can extract the parts necessary to see, for example, data totaled by year, it is very hard to ask questions based upon other aspects of time, such as 'Show sales by day of week' or 'Break down by fiscal year, starting on July 1st.'
The UDM, however, has a built in knowledge of time, including various different calendars:
- Reporting ('445' etc.);
- Manufacturing (13 periods);
Hence, the model can include a time dimension that provides a rich set of attributes defining details of each day. In the next illustration, the user elected to see the sales amount and quotas for fiscal year 2001, simply by dragging the relevant item from the tree onto the filter area. The UDM knows both how to translate that into a range of dates, and the business rule that it orders shipped on those dates, not those due or ordered, that must be included in the query. Of course, if the DueDate was also relevant, this could also be included in the model. The correct join is therefore implicitly made by the UDM.
Figure 7. A UDM including Time
Furthermore, the UDM provides specific support for answering common time related questions (such as period-to-period comparisons—'compare this month with the same month last year').
In the preceding examples, both the model contents and the data are displayed in English. If this were the only option, it would obviously be an issue for organizations with international users.
To address this, the UDM allows translations of metadata to be provided in any language. A client connecting using a particular locale would then see all metadata in the appropriate language.
Furthermore, the model can also provide translations of data. An attribute can map to different elements in the data source, providing the translations in different languages. For example, if the same tool we have been using for the preceding examples connects from a French client, it would display the UDM and query results in French, as shown below.
Figure 8. The UDM as seen by a French user
While the example model used here is of very modest size, real-world models might be defined that have a much wider scope, maybe including tens of measures and dimensions, with each dimension including tens or hundreds of attributes. Generally, however, users engaged on a particular task will not need to see the entire model. What is required is the ability to define cut down views of the entire model, to avoid users being overwhelmed with the sheer size of the model.
The UDM provides these views with perspectives. A UDM can have numerous perspectives, each one presenting only a specific subset of the model (measures, dimensions, attributes, and so forth) that is relevant to a particular group of users. Each perspective can then be associated with the user security roles that define the users that should see that perspective.
For example, a 'Seattle Inventory' perspective could be defined that includes only measures from the inventory measure group, that hides the 'Warehouse By Location' hierarchy, and makes the default City be 'Seattle.'
There is additional semantics, aimed at making the information more readily consumable, that a UDM provides for each attribute. Examples are:
- Names vs. Keys: Looking at the relational database, it might not be immediately obvious that EmployeeID is a meaningless, unique, system generated key. The UDM, however, allows the Employee attribute to have both a key (the unique EmployeeID), and a name (in this case, a concatenation of FirstName and LastName). Now a query such as 'show the employees' will correctly distinguish employees with the same name, but will display the meaningful name to the user.
- Ordering: It is sometimes the case that the values of attributes must be displayed in some fixed order, different from a simple alphabetic or numeric order. For example:
- The days of the week appear as 'Sunday,' 'Monday,' 'Tuesday,' and so on.
- Priorities are displayed in the order 'High,' 'Medium,' 'Low.'
The UDM allows the definition of a default ordering that accounts for this.
- Discretization: For numeric attributes, it is often the case that showing each distinct value of the attribute is not useful. For example, when requesting 'Sales by Product Price,' seeing all of the hundreds of different prices ( $9.97, $10.05, $10.10,...) is much less useful than seeing the sales per price range ( <$10, $10 - $15,...). The UDM allows attributes to be discretized into such ranges using various criteria.
Key Performance Indicators (KPIs)
Businesses often define Key Performance Indicators (KPIs), which are important metrics used to measure the health of the business. The UDM allows such KPIs to be defined, enabling a much more understandable grouping and presentation of data.
Each KPI in the UDM defines up to four expressions for some performance metric (Sales level, for example):
- The actual value.
- The goal value.
- The status. A normalized value between -1 and 1 that provides the status of actual vs. goal (-1 is 'very bad,' 1 is 'very good').
- The trend. A normalized value between -1 and 1 that provides the trend over time (-1 is 'getting a lot worse,' 1 is 'getting a lot better').
In addition, a KPI can define a recommended graphic, such as a traffic light to indicate 'good, average, bad,' when displaying the status and trend.
The use of KPIs allows client tools to present related measures in a way that is much more readily understood by the user. The figure below shows an example of how three KPIs, organized into display folders, might be displayed by a client tool.
Figure 9. KPIs
Interactive exploration by end-users requires fast response times. This presents challenges given the very large data sets over which such exploration is often conducted.
To improve performance, the UDM provides caching services, potentially caching both the detailed data read from the underlying data source, and pre-calculated aggregate values based upon that data. However, use of such cached values might imply some degree of staleness of the data. Business circumstances dictate how current information must be. It may be critical that some data is always shown with the latest figures, while for other data, it might be perfectly acceptable to show figures that are two hours, or two days, old.
To reflect this, the UDM allows the cache to be either explicitly managed (for example, a schedule could be defined to refresh the cache daily at 2 a.m.) or transparently managed using 'proactive caching.' The user can specify how up-to-date the data needs to be (including being totally up-to-date), and the UDM will provide automatic cache creation and management to provide the fastest possible query response while reflecting the policies controlling the currency of the data.
The previous sections addressed how the UDM can support interactive exploration of data. However, simply making information available from the underlying data sources, even if in a much more easily understood and usable form, is clearly not sufficient if we are to include significant business logic into the user's model.
Hence the UDM provides the ability to define both simple and complex calculations over the data.
At the simple end of the scale, queries generally return aggregated data ('show sales by category,' not 'show each and every sales order line'). How should various measures aggregate? For example, there is nothing in the underlying relational data saying that while 'Sales Amount' can sensibly be summed, 'Unit Price' should be averaged. The UDM adds this semantic. The method of aggregation can be defined using a variety of schemes:
- A simple aggregation function of Sum, Count, Distinct Count, Max or Min can be used.
- The aggregation can be defined as being semi-additive, using a simple function such as 'Sum' for all dimensions except time, where 'Last period' is used. For example, while the Inventory Level can be summed from Product to Product Category, the inventory level for the Month is not the sum of the inventory levels for each day, but rather the inventory level of the last day in the month.
- The aggregation can be based on the type of account, such as Income vs. Expense.
- The aggregation can be customized to fulfill any special requirements.
A UDM can also contain calculated members. These members have no direct association with the source data but are instead derived from that data. For example, a calculated member, Variance, can be defined to calculate the difference between Sales and Quota.
Similarly, a UDM can define sets of entities of interest to the user, for example the top 10 customers (by volume of sales), or the most important products. These sets can then easily be used to restrict the scope of a query to a particular set of entities.
Even considering the simple example being presented here, it is readily apparent that calculations required by users go well beyond the trivial 'Variance' example. For example:
- 'Show the moving 3 month average for each time period'
- 'What's the year on year growth compared to the same period last year?'
- 'Sales are reported in the base currency. Convert them back to the original currency, using the daily average exchange rate at the time of sale.'
- 'Calculate the budgeted sales per category for next year as 10% more than this year, then allocate down to each product based on relative average sales over the last 3 years.'
The UDM provides a rich model for defining such calculations, providing something akin to a multi-dimensional spreadsheet, where the value of a cell, for example, AverageSales for the Category 'Bike' in the Year '2003,' can be calculated based upon the values in other cells. However, even this metaphor fails to adequately describe calculations in the UDM. This is, in part, because a cell might have its value calculated not just on what the value of another cell is, but also what the value there used to be. Hence, simultaneous equations can be supported; for example, profit is derived from revenue minus expense, but bonuses (included in the expenses) are derived from profit.
In addition to providing the powerful language MDX (Multi Dimensional Expressions), specifically designed for authoring such calculations, the UDM also enables integration with .Net. This integration allows stored procedures and functions to be written in any verifiable .NET language (such as C#.NET or Visual Basic.NET), and then invoked from MDX for use in calculations.
The client, of course, is isolated from the details of such calculations. To client applications, the model now simply has more useful measures. In the example below, for instance, the user is viewing various calculated measures, based on Sales, for the most profitable products sold in the USA.
Figure 10. A UDM including analytics (Click the image to see a larger version)
Integration with Data Mining
Showing data in a rich, readily understood form is very valuable, but what about inferring new information from that data?
The UDM is tightly integrated with Data Mining technology, to allow data to be mined and later to use the discovered patterns for prediction.
For a user, seeing data often leads immediately to further questions, or to the desire to take some action. For example:
- 'What are the detailed sales going into that number?'
- 'The quota is too low—I need to increase it.'
- 'That looks odd—I want to mark that number with a comment.'
- 'What details for that promotion do we have on our web site?'
Hence it is not enough to merely present data to users in an easily understood manner. It is also necessary to make it easier for them to take action based upon the data they see.
The UDM supports this in two ways:
- by allowing changes to be made to the data;
- by allowing actions to be associated with the data.
The UDM is not read-only. The data can also be updated through the UDM. In the case of measures, the updates can actually then be held separately from the original values, as deltas to those values.
In addition, it is possible to update summary numbers. For example, consider a Budgeting scenario. While the budgeted amount might eventually be known down to a detailed level (by team and account, for example), values might first be known only at a more summarized level (by department and account type).
The UDM supports actions as a link between the data and an action taken based on the data. The main kinds of actions are:
- URL: Go to a specified URL. This supports both navigating to some URL to obtain further information, and navigating to some web based application allowing some task to be performed. For example:
- for a product, go to the company website describing that product;
- for a product/warehouse combination, go to the web-based inventory management application, passing the product/warehouse as parameters, to allow the safety stock level to be increased.
- Reporting: Execute a specified report. For example, for a given product, the action could execute a parameterized product report describing the product and the current order status.
- DrillThrough: Drill through to the lowest level of detail available. For example, a user examining total sales by product and customer could drill through to view all sales transactions contributing to the total.
The actions can be associated with particular regions of the data. For example, an action to navigate to a web page might apply to each product, but the action to see detailed stock transfer transactions would apply to each value of Quantity by product and warehouse.
While actions are defined as part of the UDM, it is the responsibility of the client application to retrieve detail of the actions that are applicable, offer them to the user, and then initiate the action as required.
While the previous section focused on the end user model, this section briefly covers the architecture and APIs.
A UDM is defined over one (or more) data sources. These data sources can be of a variety of types, including:
- Relational databases;
- Web Services;
While a specific set of data sources will be supported, other parties can provide cartridges for additional data sources.
Client tools (generic Reporting and OLAP clients as well as custom applications) access the data through the UDM. The client API is the public standard XML/A, a SOAP-based protocol for issuing commands and receiving responses, exposed as a web service. In addition, client object models are provided over XML/A, including both a managed provider (ADO MD.Net) and a native OLE DB provider.
Query commands can be issued in either SQL or MDX (Multi-dimensional Expressions, an industry standard query language orientated towards analysis).
The 'UDM-server' is Microsoft Analysis Services.
Figure 11. The UDM Architecture
As the client interfaces to the UDM are compatible with the existing Analysis Services product in SQL Server 2000, there are many existing client tools that are already enabled to work against the UDM, including:
- Office 10 tools (Excel and Office Web Components);
- Numerous external analysis and reporting tools (such as ProClarity and Crystal Reports).
- SQL Server Reporting Services
In addition, the Microsoft Reporting Services product, due to ship with the next release of SQL Server, will also work against the UDM.
UDMs can be created and managed using the same XML/A interface. There is a corresponding object model AMO (Analysis Management Objects) over XML/A, which provides support for such management operations. The SQL Server release will include a rich set of development tools for constructing a UDM. However, this is primarily in support of database administrators or power-users/analysts constructing rich models. In other scenarios, a light-weight, simple UDM is constructed on the fly over a data source, with no user intervention.
In the case of proactive caching, queries initially sent to the UDM are answered by querying the latest data in the data source. Meanwhile, a cache is built of the data and aggregated data, and upon completion subsequent queries are then answered much more rapidly from the cache. Upon a change to the data, then upon the acceptable 'latency' time being passed, the cache is dropped, queries are again directed to the data source, and a new cache is created.
Access to the UDM can be controlled. This paper provides only a brief overview of this important aspect.
The key features of security are:
- The UDM provides role-based security. Roles (for example, 'Regional analyst') can be defined, permissions granted to the roles, and users included as members of each role. The actual permissions of a user are the union of permissions granted to each role to which the user belongs. Permissions for a role can also define 'strong denies,' removing rights irrespective of other roles to which a user might belong.
- Administrative permissions (for example, to change a UDM) can be granted independently of data access permissions. Also, separate permissions can be defined for:
- reading the metadata of the object;
- access (read/write) to the data.
- Data can be secured at a fine level of granularity, down to individual cells (for example, sales for product 'Widget' to customer 'ACME'). Security can also be conditional (for example, a role might be allowed to see the total salary for a department only if there are more than five employees in that department).
- The permissions can define whether visual totals should be used, in which case totals reflect only the lower level members to which the user has permissions. Cell access can also be contingent. In this case, cells derived from other cells (such as profit derived from income and cost) are viewable only if all the other cells are also viewable.
The technology supporting the UDM will be delivered in SQL Server 2005. This release will include:
- A server product that runs as an NT service and exposes a Web service allowing for the creation, management, and querying of UDMs.
- A DLL that provides many of the same services as the NT service, including the creation of and querying of UDMs. This allows the technology to be embedded within other products. As an example of this, the SQL Server Reporting Services will exploit this functionality to allow a report to be exposed as a data source itself. Hence from the data in the report, a UDM will be dynamically constructed to expose that data to the same wide range of client query and analytics tools that work against the server product.
The APIs are an evolution of those used by the Analysis Services component of SQL Server 2000. As such, many Microsoft and external client tools and applications are already UDM-enabled.