Rebeca M. Riordan
Chapter 5 from Designing Relational Database Systems, published by Microsoft Press
In the previous chapters, we have looked at defining a particular kind of relation, called a base relation, that will be given a physical representation in the database. The relational model also supports the creation of several kinds of derived relations. A derived relation is a relation that is defined in terms of other relations rather than in terms of attributes. These named relations can be base relations or other derived relations, in any combination.
In the database schema, a base relation is represented by a table. Derived relations are represented by views in Microsoft SQL Server and queries in the Microsoft Jet database engine. For the sake of linguistic simplicity, I'll use the term "view," since it's the standard relational term. I'll also use the term "recordset" generically when I mean either a view or a query.
Views are defined in terms of the relational operations that are the subject of this chapter. Microsoft Access and the SQL Server Enterprise Manager provide a graphical interface for defining views. They can also be defined in terms of SQL SELECT statements.
SQL (usually pronounced "sequel") stands for Structured Query Language. It is a standard language for expressing relational operations. Both the Jet database engine and SQL Server support a dialect of SQL. Not, of course, the same dialect. That would be too easy. Fortunately, the differences between the two implementations don't often affect the relational algebra that we'll be discussing in this chapter. Where the syntax differs, I'll give examples from both dialects.
The SQL SELECT statement is extremely powerful and more than a little complex. A detailed examination of it is outside the scope of this book. For our purposes, we can restrict ourselves to the basic structure, which has the following syntax:
SELECTV<fieldList> FROM <recordsetList> <joinType> JOIN <joinCondition> WHERE <selectionCriteria> GROUP BY <groupByFieldList> HAVING <selectionCriteria> ORDER BY <orderByFieldList>
The <fieldList> in the SELECT clause is a list of one or more fields to be included in the recordset resulting from the statement. The fields can be actually present in the underlying recordsets, or they can be calculated. The <recordsetList> in the FROM clause is, as one might expect, a list of tables and views on which the SELECT statement is based. These are the only two clauses of the SELECT statement that must be included; all others are optional.
The JOIN clause defines the relationship between the recordsets listed in <recordsetList>. We'll be looking at joins in detail later in this chapter. The WHERE clause defines a logical expression, <selectionCriteria>, which restricts the data to be included in the resulting recordset. We'll look at restriction in detail later, as well.
The GROUP BY clause combines records having the same values in the specified list of fields to be combined into a single record. The HAVING clause is used to further restrict the fields returned after they've been combined by the GROUP BY clause. Finally, the ORDER BY clause causes the recordset to be sorted according to the fields listed in <orderByFieldList>.
On This Page
Nulls and Three-Valued Logic (One More Time)
Most of the operations of relational algebra involve the use of logical operators, operators that usually return a Boolean result—that is, True or False. I say "usually" because with the addition of nulls to the relational model things get a little more complicated.
Nulls add a third value to the set of Boolean values; you must then work with True, False, and Null. Not surprisingly, these operators become known as three-valued logic. The three-valued truth tables for the standard logical operators are shown in Figure 5-1.
As you can see, Null op anything, where op is a logical operator, results in Null. This is generally also true of the logical comparison operators, as shown in Figure 5-2.
SQL Server, for reasons that I'm sure make sense to its designers, adds an "extension" to normal logical operations. If the option ANSI_NULLS is turned off, Null = Null evaluates to True, and Null = <value>, where <value> is anything except Null, evaluates to False. (This is undoubtedly related to the issue of allowing only a single Null value in UNIQUE indices.)
SQL provides two unary operators—IS NULL and IS NOT NULL—to specifically handle Null values. They work exactly as one might expect. The truth tables for IS NULL and IS NOT NULL are shown in Figure 5-3. Again, <value> indicates anything except Null.
We'll begin our examination of relational algebra with the four types of relational operators: restriction, projection, join, and divide. The first two affect a single recordset, although that recordset can, of course, be a view based on any number of other recordsets. The join operator is perhaps the most fundamental to the relational model and defines how two recordsets are to be combined. The final operator, divide, is a rarely used but occasionally handy method of determining which records in one recordset match all the records in a second recordset.
All of these operators are implemented with some form of the SQL SELECT statement. They can be combined in any way you want, subject to the system constraints regarding maximum length and complexity for the statement.
The restriction operator returns only those records that meet the specified selection criteria. It is implemented using the WHERE clause of the SELECT statement, as follows:
SELECT * FROM Employees WHERE LastName = "Davolio";
In the Northwind database, this statement returns Nancy Davolio's employee record, since she's the only person in the table with that last name. (The * in the <fieldList> section of the statement is special shorthand for "all fields.")
The selection criteria specified in the WHERE clause can be of arbitrary complexity. Logical expressions can be combined with AND and OR. The expression will be evaluated for each record in the recordset, and if it returns True, that record will be included in the result. If the expression returns either False or Null for the record, it will not be included.
While restriction takes a horizontal slice of a recordset, projection takes a vertical slice; it returns only a subset of the fields in the original recordset.
SQL performs this simple operation using the <fieldList> section of the SELECT statement by only including the fields that you list. For example, you could use the following statement to create an employee phone list:
SELECT LastName, FirstName, Extension FROM Employees ORDER BY LastName, FirstName;
Remember that the ORDER BY clause does just that, it sorts the data; in this case, the list will be sorted alphabetically by the LastName field and then by the FirstName field.
Join operations are probably the most common relational operations. Certainly they are fundamental to the model—it would not be feasible to decompose data into multiple relations were it not possible to recombine it as necessary. This is precisely what a join operator does; it combines recordsets based on the comparison of one or more common fields.
Joins are implemented using the JOIN clause of the SELECT statement. They are categorized based on the type of comparison between the fields involved and the way the results of the comparison are treated. We'll look at each of these in turn.
When the join comparison is made on the basis of equality, the join is an equi-join. In an equi-join operation, only those records that have matching values in the specified fields will be returned.
Take, for example, the relations in Figure 5-4. This is a typical case of linked tables resulting from the normalization process. OrderID is the primary key of the Orders table and a foreign key in the Order Details table.
To recombine (and consequently denormalize) the tables, you would use the following SELECT statement:
SELECT Orders.OrderID, Orders.CustomerID, [Order Details].ProductID FROM Orders INNER JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID WHERE (((Orders.OrderID)=10248));
This statement would result in the recordset shown in Figure 5-5.
Note: If you run this query in Access 2000 using the Northwind database, the result set will show the customer name rather than the CustomerID. This is because Access allows fields to display something other than what's actually stored in them when you declare a lookup control in the table definition. This is a real advantage when Access is used interactively, but wreaks havoc on authors trying to provide examples.
A special case of the equi-join is the natural join. To qualify as a natural join, a join operation must meet the following conditions:
The comparison operator must be equality.
All common fields must participate in the join.
Only one set of common fields must be included in the resulting recordset.
There's nothing intrinsically magical about natural joins. They don't behave in a special manner, nor does the database engine provide special support for them. They are merely the most common form of join, so common that they've been given their own name.
The Jet database engine does do something particularly magical if you create a natural join that meets certain special conditions. If a one-to-many relationship has been established between the two tables, and the common fields included in the view are from the many side of the relationship, then the Jet database engine will perform something called Row Fix-Up or AutoLookup. When the cursor enters the fields used in the join criteria, the Jet database engine will automatically provide the common fields, a spectacular bit of sleight of hand that makes the programmer's life much simpler.
Technically, all joins are theta-joins, but by custom, if the comparison operator is equality, the join is always referred to as an equi-join or just as a join. A join based on any other comparison operator (<>, >, >=, <, <=) is a theta-join.
Theta-joins are extremely rare in practice, but they can be handy in solving certain kinds of problems. These problems mostly involve finding records that have a value greater than an average or total, or records that fall within a certain range.
Let's say, for example, that you've created two views, one containing the average number of units sold for each product category and a second containing the total units sold by product, as shown in Figure 5-6. We'll look at how to create these views later in this chapter. For now, just assume their existence.
The following SELECT statement, based on the comparison operator >, will produce a list of the best-selling products within each category:
SELECT DISTINCTROW ProductCategoryAverages.CategoryName, ProductTotals.ProductName FROM ProductCategoryAverages INNER JOIN ProductTotals ON ProductCategoryAverages.CategoryID = ProductTotals.CategoryID AND ProductTotals.TotalSold > [ProductCategoryAverages].[AverageSold];
The results are shown in Figure 5-7.
In this example, the view could also have been defined using a WHERE clause restriction. In fact, Access will rewrite the query when you leave SQL view to look like the following:
SELECT DISTINCTROW ProductCategoryAverages.CategoryName, ProductTotals.ProductName FROM ProductCategoryAverages INNER JOIN ProductTotals ON ProductCategoryAverages.CategoryID = ProductTotals.CategoryID WHERE (((ProductTotals.TotalSold)>[ProductCategoryAverages].[AverageSold]));
Technically, all joins, including equi-joins and natural joins, can be expressed using a restriction. (In database parlance, a theta-join is not an atomic operator.) In the case of theta-joins, this formulation is almost always to be preferred since the database engines are better able to optimize its execution.
All of the joins we've examined so far have been inner joins, joins that return only those records where the join condition evaluates as True. Note that this isn't exactly the same as returning only the records where the specified fields match, although this is how an inner join is usually described. "Match" implies equality, and as we know not all joins are based on equality.
Relational algebra also supports another kind of join, the outer join. An outer join returns all the records returned by an inner join, plus all the records from either or both of the other recordsets. The missing ("unmatched") values will be Null.
Outer joins are categorized as being left, right, or full, depending on which additional records are to be included. Now, when I was in grad school, a left outer join returned all the records from the recordset on the one side of a one-to-many relationship, while a right outer join returned all the records from the many side. For both the Jet database engine and SQL Server, however, the distinction is based on the order in which the recordsets are listed in the SELECT statement. Thus the following two statements both return all the records from X and only those records from Y where the <condition> evaluates to True:
SELECT * FROM X LEFT OUTER JOIN Y ON <condition> SELECT * FROM Y RIGHT OUTER JOIN X ON <condition>
A full outer join returns all records from both recordsets, combining those where the condition evaluates as True. SQL Server supports full outer joins with the FULL OUTER JOIN condition:
SELECT * FROM X FULL OUTER JOIN Y ON <condition>
The Jet database engine does not directly support full outer joins, but performing a union of a left outer join and a right outer join can duplicate them. We'll discuss unions in the next section.
The final relational operation is division. The relational divide operator (so called to distinguish it from mathematical division) returns the records in one recordset that have values that match all the corresponding values in the second recordset. For example, given a recordset that shows the categories of products purchased from each supplier, a relational division will produce a list of the suppliers that provide products in all categories.
This is not an uncommon situation, but the solution is not straightforward since the SQL SELECT statement does not directly support relational division. There are numerous ways to achieve the same results as a relational division, however. The easiest method is to rephrase the request.
Instead of "list the suppliers who provide all product categories," which is difficult to process, try "list all suppliers where the count of their product categories is equal to the count of all product categories." This is an example of the extension operation that we'll discuss later in this chapter. It won't always work, and in situations where it doesn't, you can implement division using correlated queries. Correlated queries are, however, outside the scope of this book. Please refer to one of the references listed in the bibliography.
The next four operators of relational algebra are based on traditional set theory. They have, however, been modified slightly to account for the fact that we're dealing with relations, not undifferentiated sets.
Conceptually, a relational union is the concatenation of two recordsets. It's more or less the relational version of addition. The result of the union of recordset A with recordset B is the same as actually adding all the records in A to B.
As an example, say you need a list of all the names and addresses known to the database for a mass mailing. The Northwind database Customers and Employees recordsets both have addresses and so can easily be combined in a union operation. In this case, we'd use the UNION statement, as follows:
SELECT CompanyName AS Name, Address, City, PostalCode FROM Customers UNION SELECT [FirstName] & " " & [LastName] AS Name, Address, City, PostalCode FROM Employees ORDER BY name;
Note that the CompanyName field is renamed "Name" and the FirstName and LastName fields from the Employees table are concatenated. The resulting field is also "Name." The union query doesn't require that the fields in the <fieldList> of each SELECT statement have the same name, but there must be the same number of them and they must have the same (or compatible) types. The results of this statement in Access are shown in Figure 5-8.
The intersection operator returns the records that two recordsets have in common. It is, in essence, a "find the duplicates" operation, and that's the way it's most often used. An intersection is implemented using outer joins.
As an example, suppose that you have inherited client lists from several legacy systems, as shown in Figure 5-9.
The following SELECT statement will return the duplicate records:
SELECT DuplicateCustomers1.* FROM DuplicateCustomers1 LEFT JOIN DuplicateCustomers2 ON (DuplicateCustomers1.CustomerID = DuplicateCustomers2.CustomerID) AND (DuplicateCustomers1.CompanyName = DuplicateCustomers2.CompanyName) WHERE (((DuplicateCustomers2.CustomerID) IS NOT NULL));
The results of this statement are shown in Figure 5-10.
While the intersection of two recordsets is used to "find the duplicates," the difference operator will "find the orphans." The relational difference of two recordsets is the records that belong to one recordset but not the other.
As an example, given the same two recordsets shown in Figure 5-9 on the preceding page, the SELECT statement below will return the unmatched records:
SELECT DuplicateCustomers1.* FROM DuplicateCustomers1 LEFT JOIN DuplicateCustomers2 ON (DuplicateCustomers1.CustomerID = DuplicateCustomers2.CustomerID) AND (DuplicateCustomers1.CompanyName = DuplicateCustomers2.CompanyName) WHERE (DuplicateCustomers2.CustomerID IS NULL);
The outer join operation in this statement returns all the records from the two lists. As you will recall, an outer join supplies Null for the fields that do not have a match in the other table. The WHERE clause uses the IS NULL operator to restrict the records returned to only those (unmatched) records.
If this all seems as clear as mud, try performing the operation in two discrete steps: first create the outer join as a view, and then restrict the view with the WHERE statement. This process is shown in Figure 5-11.
The final set operator is the Cartesian product. Like its counterpart in traditional set theory, the Cartesian product of two recordsets combines every record in one set with every record in the other.
The Cartesian product (or just "product") of two recordsets is returned by a SELECT statement with no JOIN clause. The statement below will return every customer combined with every customer service representative:
SELECT CustomerName, CSRName FROM Customer, CSRs;
Cartesian products are occasionally useful either for analysis purposes or as interim results for further manipulation. Most often, though, they're produced by accident. Forget to drag the join line in the Access query designer and bingo, you've got a Cartesian product. It's amazingly easy to do, so don't be embarrassed the first (dozen) times it happens to you.
Special Relational Operators
Various extensions to relational algebra have been proposed since the relational model was first formulated. We'll look at three that have been generally accepted: summarize, extend, and rename. We'll also look at three extensions provided by Microsoft: transform, rollup, and cube.
The summarize operator does precisely what one would expect it to do: it produces records containing summary data grouped according to the specified fields. It's an extremely useful operation in any number of situations in which you want to examine data at a higher level of abstraction than is stored in the database.
The summarize operation is implemented using the GROUP BY clause of the SELECT statement. There will be one record returned for each distinct value in the specified field or fields. If more than one field is listed, groups will be nested. For example, consider the following statement:
SELECT Categories.CategoryName, Products.ProductName, SUM([Order Details].Quantity) AS SumOfQuantity FROM (Categories INNER JOIN Products ON Categories.CategoryID = Products.CategoryID) INNER JOIN [Order Details] ON Products.ProductID = [Order Details].ProductID GROUP BY Categories.CategoryName, Products.ProductName;
This statement will return one record for each product in the Northwind database, grouped by category and containing three fields: CategoryName, ProductName, and SumOfQuantity—the total number of each product sold—as shown in Figure 5-12.
The fields listed in <fieldList> in the SELECT statement must be either part of the <groupFieldList> or an argument to a SQL aggregate function. SQL aggregate functions calculate summary values for each record. The most common aggregate functions are AVERAGE, COUNT, SUM, MAXIMUM, and MINIMUM.
Aggregates are another place where nulls can bite you. Null values are included in the summarize operation—they form a group. They are, however, ignored by aggregate functions. This is usually only a problem if you're using one of the fields in the <groupFieldList> as the parameter to an aggregate function.
The extend operator allows you to define virtual fields that are calculated based on constants and values stored in the database but that are not physically stored themselves. You create virtual fields simply by defining the virtual field in the <fieldList> of the SELECT statement, as follows:
SELECT [UnitPrice]*[Qty] AS ExtendedPrice FROM [Order Details];
The calculations defining the virtual fields can be of arbitrary complexity. This process is so simple and fast, there is rarely any justification for storing a calculated field in a table.
The final common operator is rename. The rename operation can be performed on either a recordset in <recordsetList> or on individual fields in <fieldList>. In the Jet database engine, a recordset renaming uses the following syntax:
SELECT <fieldName> AS <fieldAlias> FROM <tableName> AS <tableAlias>
In SQL Server, the "AS" keyword is not necessary, as shown below:
SELECT <fieldName> <fieldAlias> FROM <recordsetName> <recordsetAlias>
Renaming is particularly useful when you're defining a view with a self-join, as shown in the following code:
SELECT Manager.Name, Employee.Name FROM Employees AS Employee INNER JOIN Employees AS Manager ON Employee.EmployeeID = Manager.EmployeeID;
This syntax allows you to keep each usage logically distinct.
The TRANSFORM statement is the first of the Microsoft extensions to the relational algebra that we'll examine. TRANSFORM takes the results of a summarize (GROUP BY) operation and rotates them 90 degrees. More often referred to as a crosstab query, this incredibly useful operation is only supported by the Jet database engine; it has not (yet) been implemented in SQL Server.
The TRANSFORM statement has the following basic syntax:
TRANSFORM <aggregateFunction> SELECT <fieldList> FROM <recordsetList> GROUP BY <groupByList> PIVOT <columnHeading> [IN (<valueList>)]
The TRANSFORM <aggregateFunction> clause defines the summary data that will populate the recordset. The SELECT statement must include a GROUP BY clause and cannot include a HAVING clause. As with any GROUP BY clause, the <groupByList> can contain multiple fields. (In a TRANSFORM statement, the <fieldList> and <groupByList> expressions are almost always identical.)
The PIVOT clause identifies the field whose values will be used as column headings. By default, the Jet database engine will include the columns in the recordset alphabetically from left to right. The optional IN statement, however, allows you to specify column names, which will be listed in the order in which they're included in <valueList>.
The TRANSFORM statement on the following page provides essentially the same information as the summarize example given previously, the results of which are shown in Figure 5-12 on the preceding page.
TRANSFORM Count(Products.ProductID) AS CountOfProductID SELECT Suppliers.CompanyName FROM Suppliers INNER JOIN (Categories INNER JOIN Products ON Categories.CategoryID = Products.CategoryID) ON Suppliers.SupplierID = Products.SupplierID GROUP BY Suppliers.CompanyName PIVOT Categories.CategoryName;
The results of this TRANSFORM operation are shown in Figure 5-13.
The summarize operator implemented using the GROUP BY clause generates records containing summary data. The ROLLUP clause provides a logical extension to this operation by providing total values.
The ROLLUP clause is only available in SQL Server. It is implemented as an extension to the GROUP BY clause:
SELECT Categories.CategoryName, Products.ProductName, SUM([Order Details].Quantity) AS SumOfQuantity FROM (Categories INNER JOIN Products ON Categories.CategoryID = Products.CategoryID) INNER JOIN [Order Details] ON Products.ProductID = [Order Details].ProductID GROUP BY Categories.CategoryName, Products.ProductName WITH ROLLUP;
This results in the recordset are shown in Figure 5-14.
This is again the same recordset shown in Figure 5-12 on page 92, with additional rows: the rows containing Null (one is shown in the figure) contain the total values for the group or subgroup. Thus, 8,137 beverages were sold in total.
The CUBE operator is also available only in SQL Server and is implemented as an extension to the GROUP BY clause. Essentially, the CUBE clause summarizes every column in the <groupByList> by every other column. It is conceptually similar to the ROLLUP operator, but whereas ROLLUP produces totals for each column specified in the <groupByList>, CUBE creates summary data for additional groups.
For example, if you have three fields in the <groupByList>—A, B, and C—the CUBE operator will return the following seven aggregates:
The total number of Cs.
The total number of Cs, grouped by A.
The total number of Cs, grouped by C within A.
The total number of Cs, grouped by B within A.
The total number of Cs, grouped by B.
The total number of Cs, grouped by A within B.
The total number of Cs, grouped by C within B.
In this chapter, we've looked at manipulating base relations using various relational operators and seen some examples implemented in the SQL language. We've also had yet another look at the issue of nulls and three-valued logic.
Of the standard relational operators, the restriction and projection operators select subsets of a single recordset. The join, union, intersection, difference, and product operators control the way two recordsets are combined. All of these operators, with the exception of difference, can be implemented using a SQL SELECT statement. Difference can sometimes be implemented using SELECT and sometimes requires other techniques that are outside the scope of this book.
We've also looked at a few special operators. The summarize and extend operators perform calculations on the data. The rename operator controls the column headings that are displayed for the view. TRANSFORM, ROLLUP, and CUBE are special extensions to the SQL language implemented by Microsoft, each of which also provides a special way of summarizing and viewing data.
With this overview of relational algebra, we've completed Part I. Relational database theory is complex, and of course, there are issues and subtleties that I haven't been able to cover in the scope of an introductory book. But you've now seen all the major components of the theory. In the rest of the book, we'll turn to the practical aspects of designing database systems and user interfaces.
About the Author
With 17 years of experience in software design, Rebecca M. Riordan has earned an international reputation as a consultant, systems analyst, and designer of database and work-support systems. Her work experience includes leading small "strike-team" development projects in the United States and Australia, managing the professional services department of one of Australia's largest computer retailers, and serving as a senior technical support engineer for Microsoft Australia. She has presented sessions at Tech Ed conferences on topics such as reusable system components and searchable data stores, and at the Australian Computer Society on topics such as designing for usability and usability testing.
Copyright © 1999 by Rebecca Riordan
We at Microsoft Corporation hope that the information in this work is valuable to you. Your use of the information contained in this work, however, is at your sole risk. All information in this work is provided "as -is", without any warranty, whether express or implied, of its accuracy, completeness, fitness for a particular purpose, title or non-infringement, and none of the third-party products or information mentioned in the work are authored, recommended, supported or guaranteed by Microsoft Corporation. Microsoft Corporation shall not be liable for any damages you may sustain by using this information, whether direct, indirect, special, incidental or consequential, even if it has been advised of the possibility of such damages. All prices for products mentioned in this document are subject to change without notice. International rights = English only.
International rights = English only.