Unified Dimensional Model
A user who wants to retrieve information directly from a data source, such as an Enterprise Resource Planning (ERP) database, faces several significant challenges:
The contents of such data sources are frequently very hard to understand, being designed with systems and developers instead of users in mind.
Information of interest to the user is typically distributed among multiple heterogeneous data sources. Even if dealing only with different relational databases, the user must understand the details of each, such as the dialect of SQL that is used. Worse, those data sources might be of very different types, including not only relational databases but files and Web services.
Whereas many data sources are oriented toward holding large quantities of transaction level detail, frequently the queries that support business decision-making involve summary, aggregated information. With increased data volumes, the time that is required to retrieve such summary values for interactive end-user analysis can be prohibitive.
Business rules are generally not encapsulated in the data sources. Users are left to make their own interpretation of the data.
The role of a Unified Dimensional Model (UDM) is to provide a bridge between the user and the data sources. A UDM is constructed over one or more physical data sources. The user issues queries against the UDM using a variety of client tools, such as Microsoft Excel.
There are advantages to the end user even when the UDM is constructed only as a thin layer over the data source: simpler, more easily understood model of the data, isolation from heterogeneous backend data sources, and improved performance for summary type queries. In some scenarios, a simple UDM can be constructed automatically. Greater investment in the construction of the UDM can generate additional benefits that accrue from the richness of metadata that the model can provide.
The UDM provides the following benefits:
Greatly enriches the user model.
Provides high performance queries supporting interactive analysis, even over large data volumes.
Captures business rules in the model to support richer analysis.
Supports ‘closing the loop’: letting users can act upon the data they see.
Consider an example where a user wants to compare sales with quotas for different time periods.
The sales data is stored in the main Sales and Inventory database, which contains many other tables. Even after identifying the relevant tables, the user may find that data for a single entity, such as Product, is distributed over many tables. Because referential integrity is enforced by the application logic, no relationships are defined between those tables. The Sales Quotas are stored in the database of another application. Neither database captures any business rules, such as the fact that to compare quotas with actual sales, the date that the order shipped must be used instead of the many other dates for orders (date ordered, date due, data scheduled, and so on).
Accessing the Data Sources Directly
First, consider the case where the user accesses the data sources directly. The following illustration shows an example of a query being constructed using a sample tool.
By this point, the user has made significant progress. This progress includes:
Sifting through the large number of cryptically named tables to find the ones of interest.
Identifying which columns should be used to join the tables together.
Selecting those columns that contain the details of interest, from many tables with much system-oriented detail. For example, among the 11 columns in the tables that store details about product categories, only the two name columns are actually relevant to the user.
The user is now engaged in defining where ‘outer’ versus ‘inner’ joins should be used, and how to group details to provide the required aggregates.
However, the user faces more difficult tasks. For example, how can the user join data from the other data source? Even if one of the databases supported distributed queries, most users cannot construct the required query, and tools may provide insufficient support to the user in this task. The code sample shows one method of querying the external data.
SELECT Quotas.QuotaAmount, Quotas.EmployeeId, … FROM OPENROWSET('SQLOLEDB','seattle1'; 'Sales';'MyPass', 'SELECT * FROM Forecasts.dbo.SalesQuota’ ) As Quotas
When other data sources are used, such as Web services, the user is presented with another big hurdle in determining how to make the correct remote calls, and then how to process the returned XML to combine it with the other data.
Finally, after this work is performed for one query, much of it must be redone for the next query, and every succeeding query.
Accessing the Data Sources by using a UDM
By contrast, the following diagram shows an example of how building a query appears to a user accessing a simple UDM constructed over these data sources.
The design interface shown in this example is available in the development tools provided with Microsoft SQL Server 2005. However, any interface that supports the UDM could be used, including client tools such as Office Excel or Office Web Components (OWC), or one of the many reporting and analysis tools.
The tree view on the left presents the contents of the UDM. Note the following points in this example:
Only user-oriented, relevant items are exposed to the user. System columns such as row guids, or date last modified, are not visible.
The names used are friendly names, instead of the developer-oriented naming conventions employed in the underlying database.
The UDM also groups all attributes for each business entity into separate ‘dimensions’, such as Product, or Employee. The client could therefore refer to the Product Color, Subcategory, and Category in this example without explicitly performing joins between the many tables involved.
Columns that represent transaction values, or measurements are then presented as ‘measures’. For example, users are typically interested in aggregating columns such as the sales amount or sales quota. This method of presenting the data as ‘measures’ and ‘dimensions’ is called Dimensional Modeling.
The right side of the diagram shows the elements that are included in the current query. In this case, to request “Sales Amount and Quota by Product Category,” the user defined the query simply by dragging the three relevant items from the tree view into the right side of the design interface. The user did not have to specify the details that are required to actually access the two different data sources, or perform the correct joins between the many tables involved.
The model defines the simple default formatting: for example, the use of currency symbols. Richer formatting can also be defined, including conditional formatting, such as displaying a value in red if the value is below a certain threshold.
The same model supports various queries. For example, results can be broken down by employee just by dragging in an attribute from the Employee dimension.
The previous example demonstrates how even a simple UDM can greatly simplify the basic exploration of data. However, there are other challenges to consider when providing users access to data. For example:
A UDM that supports many different types of queries from different users might grow to considerable size. How can we ensure that a user who is working on a particular task is not overwhelmed with irrelevant information?
How do we support the requirements of global users, who want to see reports in their native languages?
How do we make it easy to ask all the common questions about time? For example, a user might want to show sales compared with the same period of the previous year.
This section examines some of these questions to show how the UDM supports extending the basic model to enable more advanced data exploration.
Although the consolidation of all the attributes of an entity into a dimension greatly simplifies the model for the user, there are additional relationships between the attributes that a simple list cannot express. In the previous case, Category, SubCategory and SKU define one of the hierarchies in which products can be organized. The UDM lets you define such hierarchies, because users frequently want to perform analysis based on such hierarchies. For example, after viewing the totals by Category, the user might drill down to SubCategory, and then drill down to the lowest SKU level,. Each hierarchy is a sequence of attributes that can be used in queries to ease such drill-down/drill-up scenarios.
The following diagram is an example of how hierarchies might appear in an interface shown to the end user. The model contains several different hierarchies by which products can be organized. The query that is shown here answers this question: “show sales and quotas grouped by product category, and then broken down into subcategory.” The query was defined by dragging the “Products By Category” hierarchy into the grid. To view the detailed data, the user double-clicks the “Bike” category to expand the subcategories.
The UDM handles the details of how to move between levels of a hierarchy. The UDM also handles such details as the fact that Quotas are not available at the SubCategory level, but only at the Category level.
One special kind of hierarchy is a parent-child hierarchy, covering entities that have an involuted relationship to themselves. In the next illustration, the Employee dimension has a hierarchy named “Employees By Organization Structure”. Use of this hierarchy makes it easy to navigate the parent-child relationship and analyze the rolled up values at each level of the organization. For example, the sales quota for the VP of Sales, Charles Marshall, includes the sum of the sales quotas of all his staff, plus any sales quotas associated directly with him.
Users naturally apply categorizations to their data. For example, a user might say “these attributes are all about an employees personal details" or "this attribute is an e-mail address.” The UDM provides two mechanisms specifically aimed at providing additional value based on such categorizations:
Dimensions, attributes, and other objects can be put into semantically meaningful categories, enabling the object to be used more intelligently by a client tool. For example, an attribute can be marked as being a URL. The report that contains this attribute could then enable navigation based on the values of the URL. Another attribute might be marked as being an e-mail address. In this case, a reporting client might automatically open a new e-mail upon some user action.
Measures, hierarchies, and other objects can be grouped into folders that are meaningful to the user. This grouping lets the reporting tool display large numbers of attributes in a more manageable way. For example, there might be a group of attributes labeled as ‘Customer Demographics’.
Time information is generally recorded in the underlying data source by using either DateTime or Date data types. Although users who are proficient in SQL or XPath can extract the date information that is required to total data by year, even they might find it difficult to compose a query for questions based on other aspects of time, such as “Show sales by day of week” or “Show a breakdown by fiscal year, starting on July 1.”
However, the UDM has a built in knowledge of time, which includes the following calendars:
Reporting (‘445’ etc.)
Manufacturing (13 periods)
Therefore, the model can include a time dimension that provides a rich set of attributes defining details of each day. The following illustration shows the results when the user elected to see the sales amount and quotas for fiscal year 2001. To do this, the user simply dragged the relevant item from the tree onto the filter area. The UDM knows how to translate that user action into a range of dates, and additionally understands the business rule that says that orders shipped on those dates must be included in the query, not the orders due or ordered on those dates. The correct join is implicitly made by the UDM.
Moreover, the UDM provides specific support for answering common questions related to time, including period-to-period comparisons such as “compare this month with the same month last year.”
In the previous examples, both the model contents and the data are displayed in a single language.. However, international users frequently have a need to view metadata in their local language.
To address this, the UDM allows translations of metadata to be provided in any language. A client application that connects using a particular locale would receive all metadata in the appropriate language.
The model can also provide translations of data. An attribute can map to different elements in the data source, and provide the translations for those elements in different languages. For example, if the user connects by using the same tool that we have been using for the previous examples, but from a client computer that has a French locale, both the UDM and the query results would be displayed in French, as shown in the illustration.
Although the example model used here is of very modest size, real-world models might have a much wider scope, including tens of measures and dimensions, with each dimension including tens or hundreds of attributes. Users engaged on a particular task generally do not have to see the complete model. To avoid overwhelming users with the sheer size of the model, we need the ability to define a view that shows a subset of the model.
The UDM provides such views, called perspectives. A UDM can have many perspectives, each one presenting only a specific subset of the model (measures, dimensions, attributes, and so on) that is relevant to a particular group of users. Each perspective can then be associated with the user security roles that define the users who are permitted to see that perspective.
For example, a perspective named Seattle Inventory could be defined that includes only measures from the Inventory measure group, hides the “Warehouse By Location” hierarchy, and makes the default City be “Seattle”.
A UDM provides additional semantics for attributes. These semantics are aimed at making the information more easily consumable. The following are some examples of semantics that can be applied to attributes:
Names vs. Keys: Looking at the relational database, it might not be apparent that EmployeeID is a meaningless, unique, system-generated key. The UDM resolves this issue by letting the Employee attribute have both a key (the unique EmployeeID), and a name (for example, a concatenation of FirstName and LastName). A query such as “show the employees” will correctly distinguish employees who have the same name, by using their unique IDs, but will display the meaningful name to the user.
Ordering: The values of attributes frequently must be displayed in some fixed order that is not a simple alphabetical or numeric order. The UDM lets you define a default ordering to manage this requirement. For example:
The days of the week appear as Sunday, Monday, Tuesday, and so on.
Priorities are displayed in the order High, Medium, and Low.
- The days of the week appear as Sunday, Monday, Tuesday, and so on.
Discretization: For numeric attributes, sometimes it is not useful to display each distinct value of an attribute. For example, seeing the sales for all the different prices for a product ($9.97, $10.05, $10.10,…) is much less useful than seeing the sales per price range ( <$10, $10 - $15,…). The UDM lets you discretize attributes into such ranges by using various criteria.
Key Performance Indicators (KPIs)
Businesses frequently define Key Performance Indicators (KPIs), which are important metrics used to measure the health of the business. The UDM allows such KPIs to be created, so that businesses can group and present data in a way that is easier to understand. A KPI can also use a graphic to display the status and trend, such as a traffic light to indicate good, average, or bad.
Each KPI in the UDM defines up to four expressions for each performance metric:
Status A normalized value between -1 and 1 that represents the ratio of actual vs. goal (-1 is ‘very bad’, 1 is ‘very good’)
Trend A normalized value between -1 and 1 that represents the trend over time (-1 is ‘getting much worse’, 1 is ‘getting much better’)
The use of KPIs lets client tools present related measures in a way that is much more immediately understood by the user. The following figure shows an example of how three KPIs, organized into display folders, might be displayed by a client tool.
Interactive exploration by users requires fast response times. This requirement presents challenges given the very large data sets over which such exploration is frequently conducted.
To improve performance, the UDM provides caching services. Caches can store both the detailed data read from the underlying data source, and pre-calculated aggregate values based on that data. However, use of such cached values might imply some degree of staleness of the data. Business requirements dictate how current information must be. In some cases it may be critical to show the latest data, whereas in other cases, it might be perfectly acceptable to show data that is two hours, or two days, old.
To reflect these policies dictating the currency of the data, the UDM allows the cache to be either explicitly managed (for example, a schedule could be defined to refresh the cache daily at 2 a.m.) or transparently managed by using proactive caching. The user can specify how up-to-date the data must be, and the UDM will provide automatic cache creation and management to enable the fastest possible query response.
The previous sections addressed how the UDM can support interactive exploration of data. However, just making information available from the underlying data sources, even if in a much more easily understood and usable form, clearly does not meet the goal of incorporating business logic into the user's model. Therefore, the UDM provides the ability to define both simple and complex calculations over the data.
Queries generally return aggregated data. For example, a typical query shows sales by category, instead of showing each and every sales order. However, there is nothing in the underlying relational data that defines how a particular measure should be aggregated. For example, sales amount can sensibly be summed, but unit price should be averaged. The UDM adds this semantic.
The method of aggregation can be defined by using a variety of schemes:
A simple aggregation function, such as Sum, Count, Distinct Count, Max, or Min can be used.
The aggregation can be defined as being semi-additive. This means that a simple function such as Sum is used for all dimensions except Time, where Last Period is used. For example, although the Inventory Level can be summed from Product to Product Category, the inventory level for the month is not the sum of the inventory levels for each day; instead, the inventory level for the month is the inventory level of the last day in the month.
The aggregation can be based on the type of account, such as Income vs. Expense.
The aggregation can be customized to fulfill any special requirements.
A UDM can also contain calculated members. These members have no direct association with the source data but are instead derived from that data. For example, a calculated member, Variance, can be defined to calculate the difference between Sales and Quota.
Similarly, a UDM can define sets of entities of interest to the user: for example, the top ten customers by volume of sales, or the most important products. These sets can then easily be used to restrict the scope of a query to a particular set of entities.
Sometimes the calculations required by users are far more complex than the ‘Variance’ example given earlier. The following are some examples of complex calculations:
Show the moving three-month average for each time period.
Compare year-on-year growth for this period and the same period last year.
If sales are reported in the base currency, convert sales back to the original currency, using the daily average exchange rate at the time of sale.
Calculate the budgeted sales per category for next year as a ten percent increase over this year, and then allocate budget down to each product based on relative average sales over the last three years.
The UDM provides a rich model for defining such calculations, and is like a multi-dimensional spreadsheet, where the value of a cell can be calculated based on the values in other cells. However, even this metaphor does not adequately describe the richness of calculations in the UDM. A cell might have its value calculated not just on what the value of another cell is, but also what the value in that cell used to be. Therefore, simultaneous equations can be supported; for example, profit is derived from revenue minus expense, but bonuses, which are included in the expenses, are derived from profit.
In addition to providing the powerful Multi Dimensional Expressions (MDX) language, which is designed specifically for authoring such calculations, the UDM also enables integration with Microsoft .NET. This integration lets stored procedures and functions be written in any verifiable .NET language, such as C#.NET or Visual Basic .NET. The stored procedure or function can then be invoked from MDX for use in calculations.
The client is isolated from the details of such calculations. To client applications, it only appears that now the model has more useful measures. In the example that follows, the user is viewing various calculated measures, based on Sales, for the most profitable products sold in the United States.
Integration with Data Mining
The ability to show data in a rich, readily understood form is valuable, but users also need the ability to infer new information from that data.
The UDM is tightly integrated with Data Mining technology, to allow data to be mined and later to use the discovered patterns for prediction.
For a user, seeing data frequently leads immediately to additional questions, or to the desire to take some action. For example:
“What are the detailed sales that contributed to that number?”
“The quota is too low – I have to increase it.”
“That looks odd – I want to mark that number with a comment.”
“What details for that promotion do we have on our Web site?”
It is not enough to present data to users in an easily understood manner. It is also necessary to make it easier for them to take action based upon the data they see.
The UDM supports this in two ways:
Allowing changes to be written back into the data
Enabling actions to be associated with the data
The UDM is not read-only. The data can also be updated through the UDM. In the case of measures, the updates can be stored separately from the original values, as deltas to those values.
In addition, it is possible to update summary numbers. For example, consider a Budgeting scenario. The budgeted amount might eventually be known down to a detailed level (by team and account), but initially values are known only at a more summarized level (by department and account type).
The UDM supports actions as a link between the data and an action taken based on the data. The main kinds of actions are:
URL: Go to a specified URL. This type of action supports both directing the user to some URL to obtain further information, and directing the user to some Web-based application that allows a new task to be performed. For example:
For a product, go to the company website describing that product.
For a product/warehouse combination, go to the Web-based inventory management application, passing the product/warehouse as parameters, to allow the safety stock level to be increased.
- For a product, go to the company website describing that product.
Reporting: Execute a specified report. For example, for a given product, the action could execute a parameterized product report that provides the product description and the current order status.
DrillThrough: Drill through to the lowest level of detail available. For example, a user examining total sales by product and customer could drill through to view all sales transactions contributing to the total.
The actions can be associated with particular regions of the data. For example, an action to navigate to a Web page might apply to each product, but the action to see detailed stock transfer transactions would apply to each value of Quantity by product and warehouse.
While actions are defined as part of the UDM, it is the responsibility of the client application to retrieve the details of the actions that are applicable, offer the actions to the user, and then initiate the action as required.
Access to the UDM can be controlled. The key features of security are as follows:
The UDM provides role-based security. Roles can be defined, permissions granted to the roles, and users included as members of each role. The actual permissions of a user are the union of permissions granted to each role to which the user belongs. Permissions for a role can also include ‘strong denies’, which remove rights regardless of other roles to which a user might belong.
Administrative permissions (for example, to change a UDM) can be granted independently of data access permissions. Also, separate permissions can be defined for reading the metadata of the object, and for read/write access to the data.
Data can be secured at levels of granularity down to individual cells. For example, you can limit the ability of users to view the sales of the product ‘Widget’ to customer ‘ACME’. Security can also be conditional: for example, a role might be allowed to see the total salary for a department only if there are more than five employees in that department.
The permissions can define whether visual totals should be used, in which case totals reflect only the lower level members to which the user has permissions. Cell access can also be contingent, which means that cells derived from other cells are viewable only if all the other cells are also viewable. For example, if profit is derived from income and cost, then users can view profits only for products for which they have permission to view both income and cost.