.jpg)
SQL Server
Technical Article
Writer: W.H. Inmon
Published: October 2009
Applies to: SQL Server 2008 R2
Summary: Architecture and data warehousing are
not static. From the first notion of a data warehouse to a full-blown
analytical processing architecture that includes data marts, ETL, near line
storage, exploration warehouses, and other constructs, data warehousing and its
associated architecture continue to evolve. In 2008, the book on the latest
evolution of data warehousing appeared – DW
2.0: The Architecture for the Next Generation of Data Warehousing (Morgan
Kaufman). In that book the general architecture for data warehousing in its
highest evolved form appeared.
Introduction
Architecture
and data warehousing are not static. From the first notion of a data warehouse
to a full-blown analytical processing architecture that includes data marts,
ETL, near line storage, exploration warehouses, and other constructs, data
warehousing and its associated architecture continue to evolve. In 2008, the
book on the latest evolution of data warehousing appeared – DW 2.0: The Architecture for the Next
Generation of Data Warehousing (Morgan Kaufman). In that book the general
architecture for data warehousing in its highest evolved form appeared.
.jpg)
Figure 1: Diagram of the granular data found in
DW 2.0
Among other
things, DW 2.0 recognizes the life cycle of data within the data warehouse,
recognizes the need for including textual data in the data warehouse, and
recognizes that metadata is an essential component of the data warehouse
environment. Along the way, DW 2.0 recognizes that data warehouses attract
large amounts of data, store that data over a lengthy period of time, support a
wide variety of processing, and finally – data warehouses can become very costly
if you choose to make design and infrastructure decisions that are expensive.
SQL Server in Evolution
While
architecture has been evolving, so Microsoft® SQL Server® has also been
evolving. From the humble origins as a database that served small amounts of
data on a personal computer with very basic functions, SQL Server now is
prepared to serve as a database foundation for mid-size and very large amounts
of data for data warehousing.
It is said
that to increase the capacity or performance of a system that tuning the system
can get up to a 10% improvement. But to get an order or two magnitude of
performance and capacity improvement, a change in architecture of the system is
required. And indeed that is what SQL Server has undergone – a fundamental change
in architecture from the early days of SQL Server.
Just as data
warehouse and architecture has evolved, so SQL Server has also been evolving.
And whether by chance or by design, SQL Server has turned into the preferred
technology platform for the most advanced form of data warehouse architecture –
DW 2.0.
This means
that SQL Server has advanced mightily up the evolutionary curve for serving the
data warehouse community as the database foundation for large and sophisticated
data warehouses. No longer is SQL Server limited to small amounts of data and
personal computers. With the architectural enhancements of SQL Server it is
ready to become the infrastructure of choice when implementing advanced data
warehouse and analytical architectures such as DW 2.0.
Aspects of DW 2.0
There are
many aspects to the DW 2.0 architecture. Not all of them can be addressed in
the space of this white paper. However, some of the more profound and more
important aspects of DW 2.0 will be discussed in the context of SQL Server.
Basic Access of Data
For years the
preferred storage medium for data has been disk storage. Disk storage appeared
at the time that online transaction processing was first being done. In fact,
in many ways it was the advent of disk storage that allowed online transaction
processing to become a reality. The way that online transaction processing
enables access of data for transaction processing is to access online storage
randomly. To this end data is loaded onto disk storage either by hashing the
data as it is placed on storage or by the creation of an index (or both). In
accessing data for online transactions, a random and rapid access of small
amounts of data is required.
For many
online applications and for many usages of data, a pattern of rapid, random access
of small amounts of data on disk storage works just fine. But when it comes to
DSS, analytic processing, the basic pattern of access of data is quite
different. Most DSS, analytical processing is done by means of SQL. SQL
operates on sets of data, not records of data. Therefore, for analytical
processing, an access mode where the first record being sought is accessed
randomly and then the remaining records in the set are accessed sequentially
fit optimally with data warehouse analytical processing. Furthermore the sets
of data accessed by DSS processing may not be small at all. On many occasions
very large sets of data are accessed. In other words, a sequential access mode
for most of the data to be accessed in a data warehouse is optimal, not a random
mode of access for every record of data to be sought. Figure 2 shows this
difference in the fundamental mode of access in the OLTP environment and the
data warehouse analytic environment.
.jpg)
Figure 2: Illustrating the difference between
random and sequential I/O
In the latest
release of SQL Server the first random record then a sequential mode of access
is the one that is supported. This means that at the most basic level of
processing, SQL Server holds a major performance advantage over their
competition.
A Data Mart Migration Path
Another
recurring problem with data warehouses and analytic processing is the fact that
many organizations prefer to build data marts first, before they build an
actual data warehouse. Then one day the organization wakes up and discovers
that in addition to their data marts, they need a data warehouse. It is at this
point that there is no easy or graceful migration plan to go from multiple data
marts to a data-warehouse-centric environment. Many organizations start out in
the hopes that a data mart or two are going to meet their analytical needs. But
over time the problems with a data mart centric architecture start to appear –
there is no definitive source of corporate data, there is the need to build
every data mart from scratch when a new need for data appears, data marts are
terribly brittle and need to be destroyed and rewritten when basic business
requirements change, and so forth.
In DW 2.0,
the granular data found in DW 2.0 forms what is called the “system of record”
and becomes the “single version of the truth” for the organization. And from
the system of record, data marts are created just as they were created in
classical first-generation data warehouses.
Now SQL
Server supports the easy migration from data marts to data warehouses. SQL
Server offers the ability to build small data warehouses or data marts in its
FastTrack option. Then, when the volume of data grows and the need to create a
full scale data warehouse arises, SQL Server offers their SQL Server 2008 R2
Parallel Data Warehouse (previously codenamed Project “Madison”) option.
.jpg)
Figure 3: Parallel Data Warehouse offers an easy
migration path from data marts and smaller data warehouses to enterprise-scale
data warehouses
In the SQL
Server 2008 R2 Parallel Data Warehouse edition of SQL Server there is the
opportunity to manage as much data as is possible – up to petabytes of data.
There is the opportunity to support this volume of data in a parallel fashion.
There is the opportunity for redundancy of components so that the system can
process in an efficient and in a failsafe manner.
But perhaps
most importantly there is the opportunity to synchronize – automatically – data
residing in the FastTrack data mart or mini data warehouse with the data
managed centrally by SQL Server 2008 R2 Parallel Data Warehouse. While there
are other aspects to the conversion of a data mart to a data warehouse, SQL
Server has solved some of the hardest aspects of the problem very nicely. If an
organization commits to SQL Server as a basis for data warehouse processing,
many of the problems of migration from a data-mart-centric environment to a
real data warehouse are mitigated.
Data Warehouse Costs
Another
sensitivity of DW 2.0 is the recognition that the cost of the data warehouse is
an issue. And if it is not an issue today, it will be an issue tomorrow. As
data warehouse volumes of data grow, so grows the cost associated with data
warehousing. And the costs of the data warehouse grow along with the rise in
the volume of data.
When the
discussion of cost arises, it must be noted that the more centralized the
components of technology become, the more expensive they become. For example,
suppose that an organization needs a total of n units of processing power. The
most expensive thing an organization can do is to buy one central processor
that provides n units of power. The most cost-effective thing an organization
can do is to break up the n units of power into many different units. n units
of power are needed. Suppose that the cost of n units in a single processor is
X. Now suppose that the n units are divided into ten units – n/10. Suppose that
each n/10 unit costs Y. Then:
10 x Y < X.
In fact, 10 x
Y is FAR less than X. The most expensive processing cycles are those that are
found in the largest machines. The more the workload can be divided, the less
expensive the processing cycles become. Using the above equation, it is not
unreasonable that Y would be 1/100th of X. Using this equation, 10 x Y = 1/10X.
Therefore, from an economic standpoint, it makes sense to take work that needs
to be done and distribute that work over many different processors. This has
the effect of greatly reducing the costs of the data warehouse environment.
DW 2.0 is
aware of this general cost equation. In fact DW 2.0 starts with this basic
hypothesis as the basis for all following architectural decisions. And SQL
Server is also aware of this fundamental fact of life regarding technology
costs.
SQL Server
accommodates the need for the distribution of processing across the data
warehousing environment. Figure 4 shows this basic understanding of the costs
of technology.
.jpg)
Figure 4: The unique hub-and-spoke architecture
for SQL Server’s Parallel Data Warehouse
SQL Server
distributes the processing workload in several ways. The first way that SQL
Server supports the distribution of work across multiple locations is in
support of the hub-and-spoke architecture. Basic data management is done in the
data warehouse hub, where massive amounts of data can be handled. And end-user
analytical processing is handled in the different spokes of the architecture.
In fact, within the hub processor there is a distribution of the workload. The
hub processes data differently in different places, thus avoiding a large queue
that can and will negatively impact performance.
In doing so,
the costs of the infrastructure for SQL Server are held to a minimum, thus
allowing the organization to easily and cost effectively grow their data
warehouse and to achieve good and consistent performance at the same time.
Compression
Another
aspect of data warehouses – recognized by both DW 2.0 and SQL Server - is the
need to store and manage a large volume of data. There are many ways that large
volumes of data can be managed. A simple way to manage volumes of data is
through compression. In compression, extraneous data is removed or stored in a
minimized fashion. The techniques for compression are especially applicable for
a data warehouse because data warehouses – built properly – do not allow data
to be updated. Compression actually harms performance when update of data is
allowed because it is costly for the system to go and find data, decompress it,
update it, recompress the data and then try and replace the data in the database
efficiently. But because updating of data does not occur in the data warehouse
environment, compression of data makes a lot of sense. And indeed SQL Server
allows data to be compressed.
Parallel Processing
But the
biggest gain in the management of volumes of data that is now part of SQL
Server (SQL Server 2008 R2 Parallel Data Warehouse) is that of the parallel
management of volumes of data. In parallel processing of data, data is stored on
more than one device so that more than one processor can access and manage data
at the same time. In order to understand the value of parallel management of
data, consider what a buggy driver must do when the weight of the buggy becomes
too heavy for the horse. One alternative is to go from a regular sized horse to
an oversized horse such as a Percheron or a Clydesdale. Percherons were bred
years ago to allow knights in armor to ride them into battle or jousting
contests. And a knight in armor weighs a lot. This strategy works well as long
as there is a Percheron that is available and is less than ten years old. But
what happens if there is no Percheron available? Or what happens if the load is
too heavy for a Clydesdale to pull? Then, a team of horses – not a single horse
- is needed. And up to a point more horses can be added as the load to be
pulled grows.
The same
analogy applies to managing a lot of data. If a single server is overwhelmed by
its load of data, then multiple servers can be used at the same time and the
load of data can be divided over more than one server. Such an approach is
called a parallel approach because different sets of data are operated on in
parallel independently. In doing so adding more parallel servers increases the
total throughput that a system can handle.
And the SQL
Server 2008 R2 Parallel Data Warehouse option handles data in a parallel
fashion.
.jpg)
Figure 5: Parallel management of data in the
SQL Server hub
Probability of Access of Data
But
compression and parallel approaches to the management of data are not the only
way that large volumes of data can be managed. DW 2.0 calls for the physical
separation of data based on the probability of access of the data. Very highly
accessed data needs to be placed in high performance storage. In this regard, a
data warehouse built under SQL Server is like any other database management
system. But as the volume of data grows and the probability of access of data
drops, it no longer makes sense to store all data on high-performance storage.
Not only is data that is not accessed very expensive to place on
high-performance storage, the unused data gets in the way of accessing data
whose probability of access is indeed high. By placing all data on high-performance
storage, the organization has the worst of all worlds – large expense and poor
performance.
In order to
understand why data with a low probability of access should be removed from
high-performance storage, consider that an information system is in many ways
like the blood pumping through the human body. In a young athlete running a
marathon, there is very little cholesterol. The heart pumps blood efficiently
through the blood vessels of the athlete. But now consider a lethargic couch
potato. The couch potato has a lot of cholesterol in his/her body. The heart
has to work hard to pump blood through the cholesterol-clogged arteries of the
couch potato.
Dormant,
unused data in a data warehouse is like cholesterol in the body of an athlete.
The less cholesterol there is, the more efficiently the heart pumps. The less
unused data there is in high-performance storage, the more efficient it is to
find data that is being looked for in high-performance storage.
DW 2.0
recognizes this basic fact of life and SQL Server also recognizes this fact.
.jpg)
Figure 6: Logical separation of data based on
probability of access
SQL Server
allows data to be divided according to its probability of access. In SQL Server
data can be hot, warm, or cold. By physically dividing data into different
sectors, performance of data is greatly enhanced.
Streaming Data
But there is
another very important feature of SQL Server that sets it apart from any other database
management system. That capability is the ability to handle streaming data.
DW 2.0 calls
for the separation of online high performance from integrated data processing.
DW 2.0 recognizes what is termed the “interactive sector”. It is in the
interactive layer that data can be entered in a continuous, high-performance
fashion. Figure 7 shows that there are two basic divisions of data – static
data and streamed data.
.jpg)
Figure 7: Static and streamed data
Most database
management systems manage static data. Static data is data that is recorded as
a byproduct of some event occurring on an event-by-event basis. The event that
occurs usually happens in a relaxed manner. At 10:01 am there is a bank
deposit. At 11:03 am there is an ATM activity. Throughout the day, static data
enters the system in a random, somewhat relaxed manner. Streamed data differs
from static data in that streamed data occurs and enters the database system
very rapidly and very predictably.
One way to
think of streamed data is that a single event occurs, then multiple fast
measurements are made because of the occurrence of the event.
As an example
of streamed data, consider an electronic eye that measures the output of steel
in a steel manufacturing environment. The event that occurs is the production
of a batch of steel. The output steel from the batch is rolled out in a bar
very rapidly as the smelting process is finished. An electronic eye takes a
measurement of the steel every 12 inches, or about 1,000 times a minute. As
long as the mill is producing steel, the measurements are made quickly and
consistently.
The
electronic eye measures lots of variables as the steel passes by, such as:
- temperature
of the steel
- chemical
composition of the steel
- width
of the steel bar, and so forth
In such a
manner, very large volumes of data are captured in a rapid and predictable
manner triggered by a single event. This mode of capturing data into a database
is called the streaming mode. Streaming mode data fits very nicely with the
interactive layer of data within DW 2.0. Static data does not fit well or
comfortably within the interactive layer of data within DW 2.0.
Historical Data - What Does That Mean?
One of the database
variables that must be addressed in database technology is the meaning of
historical data or archival data. When most people think of archival data, they
often think of large archives of data that are 5 to 10 years old. Indeed that
kind of data is archival data. But when one considers the meaning of historical
data, one is faced with a dilemma. Data that is one second old is historical
data. But is data that is one second old also archival data? The answer is that
if all data that is historical data is also archival data, then indeed data
that is one second old is archival data.
But in most
environments calling data that is one second old archival data simply is very
misleading. So let us call data that is still very freshly newly created
historical data and let us call data that is older than that true archival
data.
Figure 8
shows that historical data can be divided up into two classes – newly created
historical data and true archival data.
.jpg)
Figure 8: Two classes of historical data
This
distinction of what is meant by historical data is necessary for understanding
what kind of data needs to be placed in the interactive sector. In DW 2.0 there
is a sector of data called the interactive sector. The interactive sector
contains newly created historical data such a streamed data, not archival data.
This
distinction is important the face of understanding the true nature of DW 2.0’s
interactive sector data and processing. One vendor has created what is called
the “active data warehouse”. This vendor claims that up-to-the second
processing and analysis can be done with the active data warehouse in the data
warehouse. Unfortunately the data that is being managed under the active data
warehouse is static data, not streamed data. And therein lies the problem. When
static data in the interactive sector is treated as streamed data, much
confusion and much inefficiency occurs. Static data requires an infrastructure
that demands integrity – integrity of transaction processing and integrity of database
processing. By trying to treat static data as if it were streamed data in the
interactive sector, there is a very high and unnecessary cost of the
infrastructure. In addition the vendor that supports the active data warehouse
tries to query its static/streamed data using the query tools designed for
static data. Streamed data requires an entirely different means of query and
analysis. The net result of active data warehousing is confusion, waste, and
the inability to handle streamed data as it needs to be handled. Active data
warehousing is not a form of data warehousing at all. Only when the data that
is streamed is managed by techniques fit for streaming is the interactive
sector actually created.
An active
data warehouse with all of its very expensive costs and technological
limitations is no substitute for handling streamed data. It is streamed data
that belongs in the interactive sector of DW 2.0, not static data.
Unlike the
vendor that promulgates active data warehousing, SQL Server handles streamed
data. Figure 9 shows how SQL Server handles streamed data.
.jpg)
Figure 9: How SQL Server handles streamed data
Figure 9
shows that streamed data is handled separately from static data. Figure 9 shows
that streamed data is captured in the form of newly created historical data.
The data found in the steaming component/the DW 2.0 interactive sector is
accessed and manipulated by a language suited for accessing and manipulating
streams of data, not static data. The output of processing can be a report, an
answer, or the movement of streamed data into a static environment. Microsoft
will release its new streaming engine, StreamInsight, in the same timeframe as
SQL Server 2008 R2.
The Fit Between DW 2.0 and SQL Server
These then
are merely the highlights of how there is a very good architectural fit between
the architecture of the future of data warehouse – DW 2.0 – and SQL Server. DW
2.0 and SQL Server recognize –
- the
need to handle very large volumes of data
- the
need to be constantly aware the costs of data warehousing
- the
need to separate interactive streamed data and processing from other parts of
data and processing
- the
need to manage data in a parallel manner
- the
need to divide the workload onto as many smaller components as possible
- the
need for basic sequential access of sets of data
- the
need to have a rational migration path from data marts and mini data warehouses
to a large centralized data warehouse
- the
place and position of streamed data
- the
need to physically separate data based on the differences in the probability of
access of the data.
References
DW 2.0 – Architecture for the Next Generation of Data
Warehousing, Morgan Kaufman, 2008
For more information:
http://www.microsoft.com/madison:SQL Server 2008 R2 Parallel Data
Warehouse
http://www.microsoft.com/sqlserver/: SQL Server Web site
http://technet.microsoft.com/en-us/sqlserver/: SQL Server TechCenter
http://msdn.microsoft.com/en-us/sqlserver/: SQL Server DevCenter
Did this
paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to
5 (excellent), how would you rate this paper and why have you given it this
rating? For example:
Are you
rating it high due to having good examples, excellent screen shots, clear
writing, or another reason?
Are you
rating it low due to poor examples, fuzzy screen shots, or unclear writing?
This feedback
will help us improve the quality of white papers we release.
Send feedback.