Using CLR Integration in SQL Server 2005

 

Balaji Rathakrishnan
Christian Kleinerman
Brad Richards
Ramachandran Venkatesh
Vineet Rao
Isaac Kunen

Published: May 2005

Summary: This paper describes the new CLR integration features of SQL Server 2005 and how database application developers and architects can take advantage of them to write user-defined procedures, functions, and triggers, as well as define new types and aggregates. (46 printed pages)

We compare CLR-based programming against existing programming models supported in SQL Server, such as Transact-SQL and extended stored procedures, highlight the relative strengths and weaknesses of each technique, and provide a set of high-level guidelines on how to choose between the available programming alternatives. We also provide a set of code examples illustrating CLR integration features.

Contents

Introduction
Overview of CLR Integration
    Manual Deployment
    Building, Deploying, and Debugging Using Visual Studio
CLR and Its Alternatives
    CLR vs. Transact-SQL
    CLR vs. XPs
    Code in the Middle Tier
    Example: Production Scheduling
Solving Common Database Programming Tasks and Problems
    Data Validation using the .NET Framework Library
    Producing Result Sets
    Performing Custom Aggregations over Data
    User Defined Types
Conclusion

Introduction

Microsoft SQL Server 2005 significantly enhances the database programming model by hosting the Microsoft .NET Framework 2.0 Common Language Runtime (CLR). This enables developers to write procedures, triggers, and functions in any of the CLR languages, particularly Microsoft Visual C# .NET, Microsoft Visual Basic .NET, and Microsoft Visual C++. This also allows developers to extend the database with new types and aggregates.

This article describes how best to take advantage of this technology from the perspective of a database application developer, comparing CLR integration techniques with existing programming language support in SQL Server: Transact-SQL (T-SQL) and extended stored procedures (XPs). This paper is not a reference for these features; reference information can be found in the SQL Server 2005 Books Online. The samples in this paper assume the April 2005 CTP of SQL Server 2005 or later.

The target audience of this article includes database application developers, architects, and administrators. The paper assumes a working familiarity with .NET Framework-based programming and database programming.

Overview of CLR Integration

What follows is a brief overview of the SQL Server functionality that is enabled by CLR integration and how Visual Studio 2005 supports these features.

Manual Deployment

Registering and executing managed code in the database consists of the following steps:

  1. The developer writes a managed program as a set of class definitions. SQL Server routines—stored procedures, functions, or triggers—are written as static (or Shared in Microsoft Visual Basic .NET) methods of a class. User-defined types and aggregates are written as entire classes. The developer compiles the code and creates an assembly.
  2. The assembly is uploaded into a SQL Server database, where it is stored in the system catalogs using the CREATE ASSEMBLY data definition language (DDL) statement.
  3. Transact-SQL (T-SQL) objects, such as routines, types, and aggregates are then created and bound to entry points (methods in the case of routines and classes for types and aggregates) in the assembly that has been already uploaded. This is accomplished using the CREATE PROCEDURE/FUNCTION/TRIGGER/TYPE/AGGREGATE statements.
  4. After routines are created, they can be used like T-SQL routines by applications. For example, CLR functions can be called from T-SQL queries and CLR procedures can be called from a client application or from a T-SQL batch as if they were T-SQL procedures.

Building, Deploying, and Debugging Using Visual Studio

Visual Studio 2005 supports development, deployment, and debugging of managed code in SQL Server 2005. A new SQL Server Project provides code templates that make it easy for developers to get started writing code for CLR-based database routines, types and aggregates. This project also allows developers to add references to other assemblies in the database.

When a SQL Server Project is built, it is compiled into an assembly. Deploying the project uploads the assembly binary into the SQL Server database that is associated with the project. The deploy operation also automatically creates the routines, types, and aggregates defined in the assembly in the database based on the custom attributes (SqlProcedure, SqlFunction, SqlTrigger, etc.) in the code. Deploying also uploads the source code and debugging symbols (the .pdb file) associated with the assembly.

As debugging is a fundamental part of the developer experience for any platform, SQL Server 2005 and Visual Studio 2005 provide database programmers with such capabilities. A key feature in debugging SQL Server 2005 objects is the ease of setup and use. Connections to the computer running SQL Server may be debugged in much the same way as processes running under a traditional operating system. Functionality of the debugger is not affected by the type of connection to the server that the client has: both Tabular Data Stream (TDS) and HTTP connections can be debugged. In addition, debugging works seamlessly across languages, allowing the user to step from T-SQL into CLR methods and vice versa.

CLR and Its Alternatives

In evaluating the use of CLR integration a developer needs to compare it to the other available options. Here we aim to provide a base for that comparison, positioning it against existing programming techniques: Transact-SQL, extended stored procedures, and code in the middle-tier. In this section we will concentrate on user defined routines.

CLR vs. Transact-SQL

Transact-SQL (T-SQL) is the native programming language supported by SQL Server. Like most versions of SQL, it contains data manipulation features and data definition features. The data manipulation features can be broadly categorized into two parts: a declarative query language (composed of SELECT/INSERT/UPDATE/DELETE statements) and a procedural language (WHILE, assignment, triggers, cursors, etc.) Broadly speaking, CLR support in SQL Server provides an alternative to the procedural portion of T-SQL.

Even without CLR support, it is important to recognize that database applications should use the declarative query language as much as possible. This portion of the language is able to leverage the power of the query processor, which is best able to optimize and perform bulk operations. Database applications should only resort to procedural programming to express logic that cannot be expressed within the query language.

All of this remains true with CLR support in SQL Server: the CLR should not be used to write procedural code that can be expressed using the declarative features of the T-SQL language. Developers should be aware that there are a number of significant enhancements to the T-SQL query language in SQL Server 2005 that augment the expressive power of the T-SQL query language, and should ensure that they are taking full advantage of them before writing procedural code, whether in the CLR or not. Some of these added features include:

  • The ability to write recursive queries to traverse recursive hierarchies in a table
  • New analytical functions such as RANK and ROW_NUMBER that allow ranking rows in a result set
  • New relational operators such as EXCEPT, INTERSECT, APPLY, PIVOT and UNPIVOT

Developers should view the CLR as an efficient alternative for logic that cannot be expressed declaratively in the query language.

Let us look at some scenarios where CLR-based programming can complement the expressive power of the T-SQL query language. Often, there is a need for embedding procedural logic inside a query that can be called as a function. This includes situations such as:

  • Performing complex calculations (that have to be expressed using procedural logic) on a per-row basis over values stored in database tables. This can involve sending the results of these calculations to the client, or using the calculations to filter the set of rows that are sent to the client, as in the following example:

    SELECT <complex-calculation>(<column-name>,...)
    FROM <table>
    WHERE <complex-calculation>(<column-name>,...) = ... 
    
  • Using procedural logic to evaluate tabular results that are then queried in the FROM clause of a SELECT or DML statement.

SQL Server 2000 introduced T-SQL functions (both scalar and table-valued) that enable these scenarios. With SQL Server 2005, these functions can be more easily written using the CLR languages, since developers can take advantage of the much more extensive libraries in the .NET Framework API. In addition, CLR programming languages provide rich data structures (such as arrays, lists, etc.) that are lacking in T-SQL, and can perform significantly better due to the different execution models of the CLR and T-SQL.

Functions are, in general, good candidates to be written using the CLR, since there is seldom a need to access the database from within a function: values from the database are usually passed as arguments. This then plays to the strength of the CLR, which is better at computational tasks than T-SQL.

Data Access from the CLR

We now look at the CLR as an option for writing routines that perform data access, both from the perspective of the programming model and performance.

In T-SQL, query language statements such as SELECT, INSERT, UPDATE, and DELETE are simply embedded within procedural code. Managed code, on the other hand, uses the ADO.NET data access provider for SQL Server (SqlClient). In this approach, all query language statements are represented by dynamic strings that are passed as arguments to methods and properties in the ADO.NET API.

Because of this difference, data access code written using the CLR can be more verbose than T-SQL. More importantly, because the SQL statements are encoded in dynamic strings they are not compiled or validated until they are executed, impacting both the debugging of the code and its performance. However, the database programming model with ADO.NET is very similar to that used in the client or middle-tiers, which makes it easier both to move the code between the tiers and to leverage existing skills.

Again, it is important to note that both T-SQL- and CLR-based programming models use the same SQL query language; only the procedural portions differ.

As already mentioned, managed code has a decisive performance advantage over T-SQL with respect to most procedural computation, but for data-access T-SQL generally fares better. Therefore, a good general rule is that computation- and logic-intensive code is a better choice for implementation in the CLR than is data-access intensive code.

Let us now look at some typical primitives and patterns in data-access programming, comparing how T-SQL and managed programming using the integrated CLR and ADO.NET perform in these scenarios.

Sending Results to the Client

Our first scenario involves sending a set of rows to the client without "consuming" them in the server, i.e., no navigation of the rows is done inside the routine. With T-SQL, simply embedding a SELECT statement in the T-SQL procedure has the effect of sending rows produced by the SELECT to the client. With managed code, the SqlPipe object is used to send results to the client. T-SQL and ADO.NET perform approximately equally in this scenario.

Submitting SQL Statements

Submitting SQL statements from the CLR involves traversing additional layers of code to switch between managed and native SQL code. Because of this, T-SQL has a performance advantage when issuing a SQL query. Note that after the statement is submitted to the query processor, there is no difference in performance based on the source of the statement (whether in T-SQL or managed code). If the query is complex and takes a long time to evaluate then the difference in performance between T-SQL and managed code will be negligible. For short, simple queries, the overhead of the additional code layers can impact the performance of a managed procedure.

Typical data-access intensive stored procedures are likely to involve submitting a sequence of SQL statements. If the SQL statements are simple and do not take significant amount of time to execute, then the calling overhead from managed code can dominate the execution time; such procedures will perform better written in T-SQL.

Forward-Only, Read-Only Row Navigation

In T-SQL, forward-only, read-only navigation is implemented using a cursor. In CLR code it is implemented with a SqlDataReader. Typically, there is some processing done for each piece of data. Ignoring this, T-SQL has an advantage, since the navigation of rows using the CLR is slightly slower than in T-SQL. However, since the CLR will significantly outperform T-SQL on any processing done on the data, the CLR performance will overtake that of T-SQL as the amount of processing increases.

Additionally, one should be conscious of the potential for additional latency when using T-SQL cursors. Although some queries will necessarily exhibit some latency because they must materialize intermediate results, STATIC or KEYSET cursors will always materialize the final result set in a temporary table before producing any results. A cursor can be either explicitly declared as STATIC or KEYSET, or implicitly converted to one due to certain properties of the query and data. The CLR SqlDataReader will always produce results as they become available, avoiding this latency.

Row-Navigation with Updates

If the problem involves updating rows based on the current position of the cursor, then there is no relevant performance comparison: this functionality is not supported through ADO.NET, and must be done through T-SQL updateable cursors. Keep in mind, however, that it is generally better to use UPDATE statements to update rows in bulk, saving cursor-based modifications for when they cannot be expressed with declarative SQL.

Summary

Here is a summary of some guidelines we have seen can be used in choosing between CLR and T-SQL:

  • Use declarative T-SQL SELECT, INSERT, UPDATE, and DELETE statements whenever possible. Procedural and row-based processing should be used only when the logic is not expressible using the declarative language.
  • If the procedure is simply a wrapper for declarative T-SQL commands it should be written in T-SQL.
  • If the procedure primarily involves forward-only, read-only row navigation through a result set with some processing of each row, using the CLR is likely more efficient.
  • If the procedure involves both significant data access and computation, consider separating the procedural code into a CLR portion that calls into a T-SQL procedure to perform data access, or a T-SQL procedure that calls into the CLR to perform computation. Another alternative is to use a single T-SQL batch that includes a set of queries that are executed once from managed code to reduce the number of round trips of submitting T-SQL statements from managed.

Later sections discuss more in-depth when and how to appropriately use T-SQL and CLR when working with result sets.

CLR vs. XPs

In previous releases of SQL Server, extended stored procedures (XPs) were the only alternative to T-SQL with which to write server-side code with logic that was difficult to write in T-SQL. CLR integration provides a more robust alternative to XPs. In addition, with CLR integration, many stored procedures can be better expressed as table-valued functions, allowing them to be invoked and manipulated using the query language.

Some of the benefits of using CLR procedures over XPs are:

  • Granular control: SQL Server administrators have little control over what XPs can or cannot do. Using the Code Access Security model, a SQL Server administrator can assign one of three permission buckets—SAFE, EXTERNAL_ACCESS, or UNSAFE—to exert varying degrees of control over the operations that managed code is allowed to perform.
  • Reliability: Managed code, especially in the SAFE and EXTERNAL_ACCESS permission sets, provides a more reliable programming model than XPs do. Verifiable managed code ensures that all access to objects is performed through strongly typed interfaces, reducing the likelihood that the program accesses or corrupts memory buffers belonging to SQL Server.
  • Data access: With XPs, an explicit connection back to the database—a loop-back connection—must be made in order to access the local SQL Server database. In addition, this loop-back connection must be explicitly bound to the transaction context of the original session to ensure that the XP participates in the transaction in which it is invoked. Managed CLR code can access local data using a more natural and efficient programming model that takes advantage of the current connection and transaction context.
  • Additional Data Types: The managed APIs support new data types (such as XML, (n)varchar(max), and varbinary(max)) introduced in SQL Server 2005, while the ODS APIs have not been extended to support these new types.
  • Scalability: The managed APIs that expose resources such as memory, threads, and synchronization are implemented on top of the SQL Server resource manager, allowing SQL Server to manage these resources for CLR code. Conversely, SQL Server has no view or control over the resource usage of an XP. If an XP consumes too much CPU time or memory, there is no way to detect or control this from within SQL Server. With CLR code, SQL Server can detect that a given thread has not yielded for a long period of time and force the task to yield so that other work can be scheduled. Consequently, using managed code provides for better scalability and robustness.

As mentioned above, for data access and sending result sets to the client CLR routines outperform XPs. For code that does not involve data access or sending results, comparing the performance of XPs and managed code is a matter of comparing managed code with native code. In general, managed code cannot beat the performance of native code in these scenarios. Furthermore, there is an additional cost during transitions from managed to native code when running inside SQL Server because SQL Server needs to do additional book-keeping on thread-specific settings when moving out to native code and back. Consequently, XPs can significantly outperform managed code running inside SQL Server for cases where there are frequent transitions between managed and native code.

For most procedures, the benefits of managed code make CLR procedures a more attractive alternative than XPs. For cases where the performance is primarily determined by computationally-intensive processing and frequent managed-native transitions, the benefits of the CLR should be weighed against the raw performance advantage of XPs.

Code in the Middle Tier

Another choice for developers is to place their logic outside the database. This allows them to write their code in their choice of languages. By providing a rich programming model in the database, CLR integration offers developers the choice of moving such logic into the database. Of course, this clearly does not mean that all—or even most—code should be moved to the database.

Moving logic to the database tier can reduce the amount of data flowing on the network, but puts an additional load on the valuable CPU resources of the server. This tradeoff should be considered carefully before making code placement decisions for an application. The following considerations can favor the database tier as the preferred code location:

  • Data validation: Keeping the data validation logic in the database offers better encapsulation of this logic with the data, avoiding duplication of validation logic across different data touch points, such as back-end processing, bulk upload, data updates from the middle tier, etc.
  • Network traffic reduction: Placing logic in the database may be appropriate for data processing tasks that involve processing a large amount of data while producing a very small percentage of that data. Typical examples include data analysis applications such as demand forecasting, scheduling production based on forecast demands, etc.

Of course, these considerations are important even without CLR integration; CLR integration simply helps ensure that the choice of programming language does not interfere with the right code location decision.

Example: Production Scheduling

Production scheduling is a common task in manufacturing companies. At a high level, it involves creating a plan for when to produce items in order to satisfy demand, while minimizing the total cost of producing and storing the items. Several algorithms exist to take demand forecasts, inventory storage costs, and production line setup costs as input, and produce a manufacturing strategy as output.

Assuming future demand forecasts are stored in a table in SQL Server, an implementation of such an algorithm has the following characteristics:

  1. It takes a large amount of data (demand forecasts) as input.
  2. It produces a small result (such as number of units to produce at a given date or dates).
  3. It requires considerable computation to derive the output from the inputs.

Implementing such an algorithm in the middle tier is an option, but there will be a large cost in shipping demand data out of the database. Implementing it in T-SQL as a stored procedure is also feasible, but the lack of complex data types will make implementation difficult, and the performance of T-SQL will be an issue because of the quantity and complexity of the computations required. Of course, performance characteristics will vary depending on the actual volume of data and complexity of the algorithm.

To verify the suitability of CLR integration to such a scenario, we took a specific production scheduling algorithm—a dynamic programming implementation of the Wagner-Whitin algorithm—and implemented it using both the CLR and T-SQL. As expected, CLR integration greatly outperformed T-SQL. The implementation in C# was also more straightforward, since the algorithm uses single- and multi-dimensional arrays not available in T-SQL. Overall, the CLR version performed several orders of magnitude better than a T-SQL implementation.

Let us assume the following simple database schema that keeps track of the list of products that can be produced.

Table t_products:

Column name Type Description
pid int Primary key ID of a product
pName nvarchar(256) Name of the product
inventoryCost int Cost of storing this product per period
startupCost int Cost of setting up a manufacturing line to begin fabrication of this product

In addition, the following table stores the demand forecast information per product per day. We assume that the pid column is a foreign key into the table t_products, and that there is a uniqueness constraint on the pid, demandDate pair.

Table t_salesForecast:

Column name Type Description
pid int Product ID
demandDate smalldatetime Day for which demand is forecasted
demandQty int Demand forecast for given product

We created a stored procedure that takes, in addition to this product data, parameters expressing the range of dates for which it should create a production schedule.

This stored procedure returns a rowset with the schema in the following table.

Column name Type Description
product nvarchar(256) Name of product
period datetime Production day
quantity int Number of items to be manufactured

The C# version of the code is listed below to illustrate the kind of scenario that can significantly benefit from CLR integration:

using System; 
using System.Data;
using System.Collections;
using System.Data.SqlTypes;
using System.Data.SqlClient;
using Microsoft.SqlServer.Server;

public partial class ProductionScheduler
{
    const int MAXNAME = 256;                   // The maximum name size

    [Microsoft.SqlServer.Server.SqlProcedure]  // Flag as a SQL procedure 
    public static void Schedule(SqlDateTime start, SqlDateTime end)
    {
        // Guarantee that we have a valid connection while we run
        using (SqlConnection conn = 
            new SqlConnection("context connection=true"))
        {
            conn.Open();                       // open the connection
            SqlPipe pipe = SqlContext.Pipe;    // get the pipe

            // Find all the products in the database with any demand
            // whatsoever along with data about their production costs.
            // Make sure they are ordered.
            ArrayList items = new ArrayList();
            SqlCommand cmd = new SqlCommand(
                " SELECT DISTINCT tp.pid, pname, startupCost,"
                "     inventoryCost" +
                " FROM t_products tp" +
                " JOIN t_salesForecast ts" +
                " ON tp.pid = ts.pid" + 
                " ORDER BY pid",
                conn);
            SqlDataReader reader = cmd.ExecuteReader();
            while (reader.Read())
            {
                items.Add(new Item(
                    reader.GetInt32(0), reader.GetSqlChars(1),
                    reader.GetInt32(2), reader.GetInt32(3)));
            }
            reader.Close();

            // Now get all the production schedule information, ordered
            // by PID and demand date
                "SELECT pid, demandDate, demandQty" + 
                " FROM t_salesForecast" + 
                "  WHERE demandDate >= @start" +
                "  AND demandDate <= @end" +
                " ORDER BY pid, demandDate",
                conn);
            cmd.Parameters.AddWithValue("@start", start);
            cmd.Parameters.AddWithValue("@end", end);
            reader = cmd.ExecuteReader();

            // Read each section of schedule information into the items.
            reader.Read();
            for (int i = 0; (i < items.Count) && (!reader.IsClosed); i++)
            {
                ((Item)(items[i])).readData(reader);
            }

            // ensure the reader is closed
            if (!reader.IsClosed) reader.Close();


            foreach (Item item in items)
            {
                // Compute the schedule and report it
                item.ComputeSchedule();
                item.OutputSchedule(pipe);
            }
        }
    }

    class Item
    {
        // Information about the product we are scheduling. These will
        // be pulled from the database.
        private int pid;
        private SqlChars name;
        private int startCost;
        private int holdCost;

        // Store how many dates we have.
        private int size = 0;

        // The dates on which we have demand. These are guaranteed to
        // be unique by the database, and we will load them in order.
        private ArrayList dates = new ArrayList();
        // Store what the demand was on each date.
        private ArrayList quantities = new ArrayList();

        // Our schedule, which we have not yet computed.
        int[] schedule = null;

        // Set up the metadata for the return
        SqlMetaData[] metadata = new SqlMetaData[] {
            new SqlMetaData("product", SqlDbType.NVarChar, MAXNAME),
            new SqlMetaData("period", SqlDbType.DateTime),
            new SqlMetaData("quantity", SqlDbType.Int)
        };

        public Item(int pid, SqlChars name, int startCost, int holdCost)
        {
            this.pid = pid;
            this.name = name;
            this.startCost = startCost;
            this.holdCost = holdCost;
        }

        /*
         * Read data from the stream until the PID does not match
         * ours anymore. We assume the reader is cued up to our
         * information and we leave it cued to the next item's
         * information UNLESS there is no more information, in which
         * case we close the reader to indicate as much.
         */
        public void readData(SqlDataReader reader)
        {
            size = 0;
            do
            {
                if (reader.GetInt32(0) == pid)
                {
                    size++;
                    dates.Add(reader.GetDateTime(1));
                    quantities.Add(reader.GetInt32(2));
                }
                else
                {
                    return;
                }
            } 
            while (reader.Read());

            // reader ran out. close it.
            reader.Close();
        }

        /*
         * This method is called to compute the production schedule
         * for the item. It does no I/O, but puts in motion the
         * dynamic programming algorithm which produces the schedule.
         */
        public void ComputeSchedule()
        {
            int[] days = ComputeProductionDays();
            schedule = new int[size];
            for (int i = 0; i < size; i++)
            {
                schedule[days[i]] += (Int32)(quantities[i]);
            }
        }

        /* 
         * Pipe the schedule to the user, computing it if need be.
         */
        public void OutputSchedule(SqlPipe pipe)
        {
            // Ensure that the schedule has been computed.
            if (schedule == null)
            {
                ComputeSchedule();
            }

            // Make a record in which to store the data.
            SqlDataRecord record = new SqlDataRecord(metadata);
            record.SetSqlChars(0, name);

            // Start up the output pipe.
            pipe.SendResultsStart(record);
            for (int i = 0; i < size; i++)
            {
                // Pipe each day of production. Omit zero production
                // days.
                if (schedule[i] != 0)
                {
                    record.SetDateTime(1, (DateTime)(dates[i]));
                    record.SetInt32(2, schedule[i]);
                    pipe.SendResultsRow(record);
                }
            }
            pipe.SendResultsEnd();
        }

        /* 
         * Compute the table and then walk it to find the best
         * days to produce the item.
         */
        private int[] ComputeProductionDays()
        {
            // We fill this in. It says when each day's quota is
            // actually produced.
            int[] productionDays = new int[size];
            
            // First, compute the table.
            int[][] table = ComputeTable();

            // Then walk the table, creating a second table which encodes
            // the best production days.
            int[] optimal = new int[size + 1];
            int[] optimalLoc = new int[size];
            optimal[size] = 0;
            for (int i = size - 1; i >= 0; i--)
            {
                int min = table[i][i] + optimal[i + 1];
                int minloc = i;
                for (int j = i+1; j < size; j++)
                {
                    int temp = table[i][j] + optimal[j + 1];
                    if (temp < min)
                    {
                        min = temp;
                        minloc = j;
                    }
                }
                optimal[i] = min;
                optimalLoc[i] = minloc;
            }

            // Finally, decode the optimal values into production days.
            int pday = 0;
            int until = optimalLoc[0] + 1;
            for (int i = 0; i < size; i++)
            {
                if (until == i)
                {
                    pday = i;
                    until = optimalLoc[i] + 1;
                }
                productionDays[i] = pday;
            }

            // We now have a list of days which we will produce the good.
            return productionDays;
        }

        /* 
         * The main part of the dynamic programming solution. Each entry
         * table[i,j] stores the cost of producing enough of the good on
         * day i to meet needs through day j. This table is only half-
         * filled when complete.
         */
        private int[][] ComputeTable()
        {
            int[][] table = new int[size][];
            for (int i = 0; i < size; i++) table[i] = new int[size];
            for (int i = 0; i < size; i++)
            {

                // If we produce the good on the same day we ship it we
                // incur a startup cost.
                table[i][i] = startCost;
                
                // For other days, we have the cost for the previous
                // cell plus the cost of storing the good for this long.
                for (int j = i + 1; j < size; j++)
                {
                    table[i][j] = table[i][j - 1] + 
                        (((int)quantities[j]) * holdCost *
                        diff((DateTime)(dates[i]), (DateTime)(dates[j])));
                }
            }
            return table;
        }

        /* 
         * A utility to compute the difference between two days. 
         */
        private int diff(DateTime start, DateTime end)
        {
            TimeSpan diff = end.Subtract(start);
            return diff.Days;
        }

    }
};

Solving Common Database Programming Tasks and Problems

The previous section positioned CLR based programming at a high level in relation to T-SQL, extended stored procedures (XPs), and code in the middle-tier. In this section, we look at a set of programming tasks and patterns that a database application developer might encounter, and discuss how (or how not) to use CLR integration to solve them. We provide several code examples, both in C# and Visual Basic .NET.

Data Validation using the .NET Framework Library

CLR integration in SQL Server 2005 allows users to leverage the rich functionality provided by the .NET Framework class libraries to solve their database programming problems.

An example of this is the use of regular expressions to provide more complete textual pattern-matching than what is available using the LIKE operator in T-SQL. Consider the following code, which is nothing more than a simple wrapper around the RegEx class in the System.Text.RegularExpressions namespace.

In Visual Basic .NET:

Imports System.Data.SqlTypes
Imports Microsoft.SqlServer.Server
Imports System.Text.RegularExpressions

Partial Public Class Validation

    <SqlFunction(IsDeterministic:=True, IsPrecise:=True)> _
    Public Shared Function RegExMatch(ByVal pattern As String, _
        ByVal matchString As String) As Boolean

        Dim r1 As Regex = New Regex(pattern.TrimEnd(Nothing))
        Return r1.Match(matchString.TrimEnd(Nothing)).Success
    End Function

    <SqlFunction(IsDeterministic:=True, IsPrecise:=True)> _
    Public Shared Function ExtractAreaCode(ByVal matchString As String)_
        As SqlString

        Dim r1 As Regex = New Regex("\((?<ac>[1-9][0-9][0-9])\)")
        Dim m As Match = r1.Match(matchString)

        If m.Success Then
            Return m.Value.Substring(1, 3)
        Else
            Return SqlString.Null
        End If

    End Function
End Class

In C#:

using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;

public partial class Validation
{
    [SqlFunction(IsDeterministic = true, IsPrecise = true)]
    public static bool RegExMatch(string pattern, string matchString)
    {
        Regex r1 = new Regex(pattern.TrimEnd(null));
        return r1.Match(matchString.TrimEnd(null)).Success;
    }

    [SqlFunction(IsDeterministic = true, IsPrecise = true)]
    public static SqlString ExtractAreaCode(string matchString)
    {

        Regex r1 = new Regex("\\((?<ac>[1-9][0-9][0-9])\\)");
        Match m = r1.Match(matchString);
        if (m.Success)
            return m.Value.Substring(1, 3);
        else return SqlString.Null;
    }
}

Let us assume that the RegExMatch() and ExtractAreaCode() methods have been registered as user-defined functions in the database with the RETURNS NULL ON NULL INPUT option, so that the function returns null when any of its inputs are null. This allows us to avoid writing any special null handling code inside the function.

We can now define constraints on columns of a table that use the above code to validate e-mail addresses and phone numbers as follows:

CREATE TABLE contacts 
(
    firstName nvarchar(30),
    lastName  nvarchar(30),
    emailAddress nvarchar(30) CHECK 
    (dbo.RegExMatch('[a-zA-Z0-9_\-]+@([a-zA-Z0-9_\-]+\.)+(com|org|edu)', 
        emailAddress) = 1),
    usPhoneNo nvarchar(30) CHECK 
    (dbo.RegExMatch(
        '\([1-9][0-9][0-9]\) [0-9][0-9][0-9]\-[0-9][0-9][0-9][0-9]', 
        usPhoneNo)=1),
    areaCode AS dbo.ExtractAreaCode(UsPhoneNo) PERSISTED 
)

Note that the areaCode column is a persisted computed column that extracts the area code from the usPhoneNo column using the ExtractAreaCode() function. The areaCode column could be indexed, facilitating queries against the table that look for contacts by area code.

More generally, this example demonstrates how one can leverage the .NET Framework libraries to augment the T-SQL built-in function library with useful functions that are hard to express in T-SQL.

Producing Result Sets

The need to produce result sets from a database object (such as a stored procedure or a view) running inside the server is one of the most common database programming tasks. If the result set can be built using a single query then this can be achieved by simply using a view or an inline table-valued function. However, if there is a need for multiple statements and procedural logic to build the result set then there are two options: stored procedures and table-valued functions. While SQL Server 2000 has table-valued functions, they can only be written in T-SQL. With CLR integration in SQL Server 2005 these functions can also be written using a managed language. In this section we look at how to write stored procedures and table-valued functions using the CLR.

From T-SQL, it is possible to return relational results either as the return value of a table-valued function or through the ever present implicit "caller's pipe" within a stored procedure: from anywhere within a stored procedure—regardless of nesting of execution levels—a SELECT statement executed will return results to the caller. More precisely, this is true of SELECT statements that do not perform variable assignment. FETCH, READTEXT, PRINT, and RAISERROR statements also implicitly return results to the caller.

Note that "the caller" hasn't been properly defined; it will depend on the invocation context of a stored procedure.

If a stored procedure is invoked from any of the client data access APIs (such as ODBC, OLEDB, or SQLClient), the caller is the actual API and whichever abstraction it provides to represent results (e.g., hstmt, IRowset, or SqlDataReader). This means that, in general, results produced from within a stored procedure will be returned all the way back to the invoking API, bypassing all T-SQL frames on the stack, as in the following example:

CREATE PROC proc1 AS
   SELECT col1 FROM dbo.table1;
CREATE PROC proc2 AS 
   EXEC proc1;

Upon execution of procedure proc2, the results produced by proc1 will go to the caller of proc2. There is only one way in which proc2 can capture the results produced: by streaming the results to disk by using INSERT or EXEC into either a permanent or temporary table, or a table variable.

CREATE PROC proc2 AS
   DECLARE @t TABLE(col1 INT);
   INSERT @t (col1) EXEC proc1;
   -- do something with results

In the case of INSERT or EXEC the caller is the target table or view of the INSERT statement.

SQL Server 2005 CLR stored procedures introduce a new type of caller. When a query is executed using the ADO.NET provider from within a managed frame, the results are made available through the SqlDataReader object and can be consumed within the stored procedure.

In Visual Basic .NET:

...
Using conn As New SqlConnection("context connection = true")
    conn.Open()
    Dim cmd As SqlCommand = new SqlCommand( _
        "SELECT col1 FROM dbo.table1", conn)
    Dim reader As SqlDataReader = cmd.ExecuteReader()
           
    Do While reader.Read()
        ' Do something with current row
    Loop       
End Using
...

In C#:

...
using (SqlConnection conn= new SqlConnection("contect connection = true"))
{
    ...
    SqlCommand cmd = new SqlCommand(
        "SELECT col1 FROM dbo.table1", conn);
    SqlDataReader reader = cmd.ExecuteReader();
    while (reader.Read())
    {
        // do something with current row
    }
    ...
}
...

The remaining issue is how a managed routine returns its results to the caller. This is done differently in CLR stored procedures and table-valued functions. A stored procedure uses a static instance of SqlPipe available from the SqlContext class to send data back, while a table-valued function implements an interface which allows SQL Server to retrieve the results. Both of these are discussed next.

CLR Stored Procedures and SqlPipe

Of the methods available in the SqlPipe class, the simplest is ExecuteAndSend(), which takes a command object as an argument. This method executes the command, but instead of making the results of the execution available to the managed frame, the results are sent to the invoker of the stored procedure. This is semantically equivalent to embedding a statement inside a T-SQL stored procedure, and, while clumsier, it is on par with the T-SQL equivalent in terms of performance.

A simple stored procedure to execute a SELECT In T-SQL:

CREATE PROC proc1 AS
   SELECT col1 FROM dbo.table1;

The equivalent in C# would be:

...
[Microsoft.SqlServer.Server.SqlProcedure]
public static void proc1()
{
    using (SqlConnection conn =
        new SqlConnection("context connection = true"))
    {
        conn.Open();
        SqlCommand cmd = new SqlCommand(
            "SELECT col1 FROM dbo.table1", conn);
        SqlContext.Pipe.ExecuteAndSend(cmd);
        conn.Close();
    }
}
...

And in Visual Basic .NET:

...
<Microsoft.SqlServer.Server.SqlProcedure()> _
Public Shared Sub VBproc1()
    Using conn As New SqlConnection("context connection=true")
        conn.Open()
        Dim cmd As SqlCommand = new SqlCommand( _
            "SELECT col1 FROM dbo.table1", conn)
        SqlContext.Pipe.ExecuteAndSend(cmd)
        conn.Close()
    End Using
End Sub
...

SqlPipe.ExecuteAndSend() works well for scenarios where the data to be returned is produced directly by a query being executed. However, there may be cases in which it is desirable either to manipulate the data before sending it, or to send data which was obtained from sources outside the local SQL Server instance.

SqlPipe provides a group of methods that work together to enable applications to return arbitrary results to the caller: SendResultsStart(), SendResultsRow(), and SendResultsEnd(). For the most part, these APIs are similar to the srv_describe and srv_sendrow APIs available for extended stored procedures.

SendResultsStart() takes a SqlDataRecord as an argument and indicates the beginning of a new result set. This API reads the metadata information from the record object and sends it to the caller. Rows can subsequently be returned by invoking SendResultsRow()once for each row that is to be sent. After all of the desired rows have been sent, a call to SendResultsEnd() is required to indicate the end of the result set.

Example: Returning a RSS Feed in a CLR Stored Procedure

The following C# code fragment represents a portion of a stored procedure that reads an XML document—a Really Simple Syndication (RSS) feed from MSDN—from the web, uses System.Xml classes to parse it, and returns the information in relational form. Note that the code must be deployed in an EXTERNAL_ACCESS or UNSAFE assembly because the Code Access Security (CAS) permissions that are required to access the Internet are available only in these permission sets.

...
using (SqlConnection conn = 
    new SqlConnection("context connection = true"))
{
    // Retrieve the RSS feed
    XPathDocument doc = new 
        PathDocument("https://msdn.microsoft.com/sql/rss.xml");
    XPathNavigator nav = doc.CreateNavigator();
    XPathNodeIterator i = nav.Select("//item");

    // create metadata for four columns
    // three of them are string types and one of the is a datetime
    SqlMetaData[] rss_results = new SqlMetaData[4];
    rss_results[0] = new SqlMetaData("Title", SqlDbType.NVarChar, 250);
    rss_results[1] = new SqlMetaData("Publication Date", 
        SqlDbType.DateTime);
    rss_results[2] = new SqlMetaData("Description", 
        SqlDbType.NVarChar, 2000);
    rss_results[3] = new SqlMetaData("Link", SqlDbType.NVarChar, 1000);

    // construct the record which holds metadata and data buffers
    SqlDataRecord record = new SqlDataRecord(rss_results);

    // cache a SqlPipe instance to avoid repeated calls to  
    // SqlContext.GetPipe()
    SqlPipe sqlpipe = SqlContext.Pipe;
 
    // send the metadata, do not send the values in the data record
    sqlpipe.SendResultsStart(record);

    // for each xml node returned, extract four pieces 
    // of information and send back each item as a row
    while (i.MoveNext())
    {

        record.SetString(0, (string)
                            
        i.Current.Evaluate("string(title[1]/text())"));
            record.SetDateTime(1, DateTime.Parse((string)
                                             
        i.Current.Evaluate("string(pubDate[1]/text())")));
            record.SetString(2, (string)
                                    
        i.Current.Evaluate("string(description[1]/text())"));
            record.SetString(3, (string)
                                     
        i.Current.Evaluate("string(link[1]/text())"));
        sqlpipe.SendResultsRow(record);
    }

    // signal end of results
    sqlpipe.SendResultsEnd();
}
...

Note that between calls to SendResultsStart() and SendResultsEnd(), the SqlPipe is set to a busy state, and invoking any Send method other than SendResultsRow() will result in an error. The SendingResults property is set to true while the SqlPipe is in this busy state.

Table-Valued Functions

CLR integration also enables support for table-valued functions (TVFs) written in managed languages. Similar to the T-SQL equivalent, CLR TVFs are primarily used to return tabular results. The most notable difference is that T-SQL table-valued functions temporarily store results in a work table, whereas CLR TVFs are capable of streaming the results produced, meaning that results do not have to be fully materialized before returning from the function.

Note that although T-SQL also has the notion of inline TVFs which do not temporarily store results, inline T-SQL TVFs are for the most part syntactic sugar to specify sub-queries, possibly with parameters.

Managed TVFs return a standard IEnumerable interface. This interface provides an IEnumerator of objects representing the rows of the table, and will be retrieved by the query processor one-by-one until the enumeration's MoveNext() method returns false.

These objects are opaque to SQL Server, and must be cracked by another function. The method for filling the row and the schema of the table are defined in the annotation for the TVF. This method takes an object as input and returns the fields of the row using reference parameters, implying that the signature of this method is not fixed: its signature match the schema.

Example: A Table-Valued Function to Retrieve a RSS Feed

Here we present the RSS retrieval recast as a table-valued function in C#. Note the correspondence between the SqlFunction annotation on the RSS_TVF() method and the signature of the FillTVFRow() method.

using System;
using System.Data;
using System.Data.Sql;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Xml.XPath;
using System.Collections;

public partial class UserDefinedFunctions
{
    [SqlFunction(FillRowMethodName = "FillTVFRow", 
        TableDefinition = "Title nvarchar(250), " +
            "PublicationDate datetime, " +
            "Description nvarchar(2000), " +
            "Link nvarchar(1000)")
    ]
    public static IEnumerable RSS_TVF()
    {
        return new RssReader();
    }

    public static void FillTVFRow(object row, out SqlString str, 
        out SqlDateTime date, out SqlString desc, out SqlString link)
    {
        // split each object array
        object[] rowarr = (object[])row;
        str = (SqlString)(rowarr[0]);
        date = (SqlDateTime)(rowarr[1]);
        desc = (SqlString)(rowarr[2]);
        link = (SqlString)(rowarr[3]);
    }
}

public class RssReader : IEnumerable
{
    XPathDocument doc;
    XPathNavigator nav;

    // Construct helper class, initializing metadata for the results
    // reading from the RSS feed, creating the iterator
    public RssReader(string site)
    {
        // Retrieve the RSS feed
        //doc = new XPathDocument("https://msdn.microsoft.com/sql/rss.xml");
        doc = new XPathDocument(site);
        nav = doc.CreateNavigator();
    }

    public IEnumerator GetEnumerator()
    {
        return new RSSEnumerator(this);
    }

    private class RSSEnumerator : IEnumerator
    {
        XPathNodeIterator i;
        Object[] current;
        RssReader reader;

        public RSSEnumerator(RssReader reader)
        {
            this.reader = reader;
            Reset();
        }

        public void Reset()
        {
            i = reader.nav.Select("//item");
        }

        public bool MoveNext()
        {
            if (i.MoveNext())
            {
                current = new Object[4];
                current[0] = new SqlString((string)
                    i.Current.Evaluate("string(title[1]/text())"));
                current[1] = new SqlDateTime(DateTime.Parse((string)
                    i.Current.Evaluate("string(pubDate[1]/text())")));
                current[2] = new SqlString((string)
                    i.Current.Evaluate("string(description[1]/text())"));
                current[3] = new SqlString((string)
                i.Current.Evaluate("string(link[1]/text())"));
                return true;
            }
            else return false;

        }

        public Object Current
        {
            get
            {
                return current;
            }
        }
    }
}

A simple query to consume the results from this table-valued function would look like the following:

SELECT *
FROM RSS_TVF() 

Naturally, richer queries can be expressed over the TVF form of this data. Assuming we have a CanonicalURL() function that returns the canonical version of a URL, the data from the RSS feed can easily be returned using canonical URLs:

select title, publicationDate, description, dbo.CanonicalURL(link) 
from dbo.RSS_TVF()
order by publicationDate

Note that in this example, we are not leveraging the streaming capabilities of the TVF because we are consuming the whole RSS feed, building a navigator on top, and then iterating over the individual items as calls to MoveNext() are made. However, it is possible to consume the results from a Web source using a streaming API, and iterate over the produced XML with an XmlReader. It is important to note that given the execution model difference between table-valued functions in the CLR and those in T-SQL, a major performance difference may be observed in favor of CLR TVFs, especially for scenarios in which it is possible to stream the results.

Example: Cracking Scalars into Rows

There is often a need to pass multi-valued arguments in an application. For example, an order processing system may need a stored procedure that inserts an order into a table of orders. A desired argument of the stored procedure is likely to be the line-items in the order; however, this runs into the limitation that T-SQL does not support table-valued arguments, and lacks collections and arrays.

One way around this is to encode the collection as a scalar value—as an nvarchar or xml, for example—and pass it as an argument to the stored procedure. The stored procedure can use a table-valued function that takes the scalar input and converts it into a set of rows, which can then be inserted into a table of line-items or otherwise manipulated.

While the table-valued function can be written in T-SQL, it will have better performance if written in the CLR. Because it can take advantage of the string manipulation functions in the System.Text namespace, it is also exceedingly simple to implement.

The following shows the implementation of a table-valued function that takes a semicolon-separated input string, and returns the pieces in the form of a set of rows.

In Visual Basic .NET:

Imports System.Collections
Imports System.Data.SqlTypes
Imports Microsoft.SqlServer.Server

Partial Public Class UserDefinedFunctions

    ' This needs to return an IEnumerable, but since an array
    ' does, in this case we do not need to define a new class
    ' that implements it: we can simply return the array.
    <SqlFunction(FillRowMethodName:="FillRow", _
        TableDefinition:="value nvarchar(60)")> _
    Public Shared Function GetStrings(ByVal str As SqlString) _
        As IEnumerable

        Return str.Value.Split(";"c)
    End Function

    ' This method does the decoding of the row object. Since the
    ' row is already a string, this method is trivial. Note that
    ' this method is pointed to by the annotation on the
    ' GetString method.
    Public Shared Sub FillRow(ByVal row As Object, _
        ByRef str As String)

        str = CType(row, String)
    End Sub

End Class

In C#:

using System.Collections;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

public partial class UserDefinedFunctions
{
    /*
     * This needs to return an IEnumerable, but since an array
     * does, in this case we do not need to define a new class
     * that implements it: we can simply return the array.
     */
    [SqlFunction(FillRowMethodName="FillRow",
        TableDefinition="value nvarchar(60)")]
    public static IEnumerable GetStrings(SqlString str)
    {
        return str.Value.Split(';');
    }

    /*
     * This method does the decoding of the row object. Since the
     * row is already a string, this method is trivial. Note that
     * this method is pointed to by the annotation on the
     * GetString method.
     */
    public static void FillRow(object row, out string str)
    {
        str = (string)row;
    }
}

If we assume that the GetStrings() method is registered as a TVF of the same name, then the following is a fragment of T-SQL of a stored procedure that uses this TVF to extract the line items from an order in tabular form.

CREATE PROCEDURE Insert_Order @cust_id int, @lineitems nvarchar(8000)
AS
BEGIN
    ...
    INSERT LineItems
    SELECT * FROM dbo.GetStrings(@lineitems)
    ...
END

Which One to Use?

The decision whether to use a stored procedure in conjunction with the SqlPipe—either implicitly in T-SQL or explicitly in a CLR routine—or a table-valued function depends on several factors that must be taken into consideration, including composability requirements, the source of the data, the need for side-effects, and the typing requirements for the results. We discuss each of these in turn.

Composability Requirements

It may desirable to reuse or further manipulate results produced inside a TVF or a stored procedure. Table-valued functions are more versatile from a composability perspective, as the return type of a TVF is a relational rowset that can be used in any place where such a construct is allowed. In particular, it can be used in the FROM clause of SELECT statements, and as such the results produced can benefit from the composability of SELECT in sub-queries, INSERT...SELECT statements, derived tables, table expressions, etc.

On the other hand, from within the T-SQL language, stored procedures can be composed only as part of the INSERT...EXEC combination that allows produced results to be stored in a permanent or temporary table. The INSERT operation represents an actual copy of the data, which will likely have a performance impact.

If composability and reuse of results from within the server is a requirement, TVFs are a better alternative. If the results produced need to only be streamed back to the client- or middle-tier, either approach is reasonable.

Source of the Data

The source of the data being returned is another important factor in deciding between T-SQL- and CLR-based implementations. Results can be produced either by reading from some source in the local instance using the ADO.NET provider, or from a source external to SQL Server. For external sources, a CLR based implementation is a better choice than T-SQL because of the ease with which the logic accessing the external data can be implemented.

In the case of generating results based on a query executed on the local instance using the ADO.NET provider, a stored procedure would generally execute a query, iterate through the result, and perform some operation on the rows before sending them back through a SqlPipe.

With a TVF, one would expect to wrap a SqlDataReader in a custom implementation of IEnumerable that transformed the results as they were read. However, SQL Server 2005 does not allow requests to be pending by the time a table-valued function returns: any query executed through the ADO.NET provider must be fully executed and results completely consumed before the function body can return. An error is raised if the return statement is executed while SqlDataReader operations from the ADO.NET provider are pending. This implies that in most cases where data is being returned from the local database instance it cannot be streamed through a CLR TVF. If other factors, such as composability, require this to be written as a TVF, writing it in T-SQL may be the only option. Otherwise, using a managed stored procedure using SqlPipe is possible.

For the case in which results are to be produced from within a stored procedure based on data from the local instance, use of the SendResults APIs makes sense only in cases where the results require some procedural modification or manipulation. If the results are to be sent to the caller unmodified, SqlPipe.ExecuteAndSend() should be used, as it has better performance

Operations with Side-Effects

In general, operations which produce side-effects—operations which change the state of the database, such as DML statements or transaction operations—are disallowed from user defined functions, including table-valued functions. These operations may be desired, however. For example, one may wish to set a SAVEPOINT transaction, perform some operation, and roll back to the SAVEPOINT in the case of an error.

Given that side-effects are disallowed from user-defined functions, such a scenario could only be implemented through a stored procedure, and results would have to be returned through SqlPipe. Note, however, that operations with side-effects are not allowed to execute through the ADO.NET provider while the SqlPipe is busy sending results. These operations are allowed only before the result set has been started or after it has been completed.

Typing of and Number of Results

The description of results produced by a CLR stored procedure through SqlPipe differs from that of a CLR TVF, consistent with their counterparts in T-SQL. A TVF is strongly typed, and as part of the registration (CREATE FUNCTION) statement, it must statically define the type of its return value.

On the other hand, a stored procedure declaration makes no statement about results produced—or even whether it produces them. This may seem a convenience, and though it certainly provides greater flexibility, greater care must be taken in writing applications that execute stored procedures, which can dynamically redefine the shape of their results. However, if the schema for the results needs to be variable across invocations, a stored procedure should be used, since only SqlPipe provides this flexibility.

In fact, the weak typing of results produced through SqlPipe inside stored procedures goes beyond the schema of a single result includes the possibility of returning a variable number of result sets. Both the types of and number of result sets can be determined dynamically by the stored procedure.

Summary

The following table summarizes the guidelines on how to choose whether a particular application should be written in T-SQL or the CLR, and whether a stored procedure or table-valued function should be used.

Condition Yes No
Composability required? TVF Procedure or TVF
External data source (vs. accessing only local data)? CLR TVF or CLR procedure (accessing only local data) T-SQL TVF or procedure
Side effects required? Procedure Procedure or TVF
Fixed results schema? Procedure or TVF Procedure
More than one result set? Procedure Procedure or TVF
Ability to stream results? CLR TVF T-SQL TVF

For the most of this section, sending results through the SqlPipe has been tightly associated with procedures. Even though SqlPipe and the possibility to return results is available in the body of CLR triggers, this practice is highly discouraged, as it can lead to unexpected results for those issuing Data Manipulation Language or Data Definition Language statements with triggers defined on their target objects.

Performing Custom Aggregations over Data

There are a number of scenarios where aggregation may need to be performed on data, including performing statistical calculations, such as finding averages, standard deviations, etc. If the desired aggregation function is not built in, there are several ways to add the functionality in SQL Server 2005:

  • Write the aggregation as a user-defined aggregate (UDA).
  • Write the aggregate using a CLR stored procedure.
  • Use a server-side cursor in T-SQL.

Let us examine the three alternatives in the context of a simple aggregation function which calculates the product of a given set of values.

Example: Product Implemented as a User-Defined Aggregate Function

Here is the code for this task written as a user-defined aggregate. The logic for calculating the product as it comes in is in the Accumulate() method. The Merge() method defines what will happen should two such aggregates be merged.

In Visual Basic .NET:

Imports System
Imports System.Data.SqlTypes
Imports Microsoft.SqlServer.Server

<Serializable()> _
<SqlUserDefinedAggregate(Format.Native)> _
Public Structure ProductAgg

    Private product As SqlInt32

    Public Sub Init()
        product = 1
    End Sub

    Public Sub Accumulate(ByVal value As SqlInt32)
        product = product * value
    End Sub

    Public Sub Merge(ByVal group As ProductAgg)
        product = product * group.product
    End Sub

    Public Function Terminate() As SqlInt32
        Return product
    End Function

End Structure

In C#:

using System;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

[Serializable]
[SqlUserDefinedAggregate(Format.Native)]
public struct ProductAgg
{
    private SqlInt32 product;

    public void Init()
    {
        product = 1;
    }

    public void Accumulate(SqlInt32 value)
    {
        product = product * value;
    }

    public void Merge(ProductAgg group)
    {
        product = product * group.product;
    }

    public SqlInt32 Terminate()
    {
        return product;
    }
}

After this type is built and registered with SQL Server, it can be used just as a built-in aggregate from T-SQL:

SELECT dbo.ProductAgg(intcol) 
FROM tbl
GROUP BY col

Example: Product as a Managed Stored Procedure

A stored procedure can be created that iterates over the data to perform the computation. This iteration is accomplished using the SqlDataReader class as shown below.

In Visual Basic .NET:

Imports Microsoft.SqlServer.Server
Imports System.Data.SqlTypes
Imports System.Data.SqlClient

Partial Public Class StoredProcedures

    <SqlProcedure()> _
    Public Shared Sub VBProductProc(ByRef value As SqlInt32)

        ' The empty product is 1
        value = 1

        Using conn As New SqlConnection("context connection = true")
            conn.Open()
            Dim cmd As SqlCommand = New SqlCommand()
            cmd.Connection = conn
            cmd.CommandText = "SELECT intcolumn FROM tbl"
            Dim r As SqlDataReader = cmd.ExecuteReader()
            Using r
                Do While r.Read()
                    value = value * r.GetSqlInt32(0)
                Loop
            End Using
            conn.Close()
        End Using
    End Sub

End Class

In C#:

using Microsoft.SqlServer.Server;
using System.Data.SqlTypes;
using System.Data.SqlClient;
public partial class StoredProcedures
{
    [SqlProcedure]
    public static void ProductProc(out SqlInt32 value)
    {
        // Ensure that we write to value.
        // Empty product is 1.
        value = 1;
        
        using (SqlConnection conn = 
            new SqlConnection("context connection = true"))
        {
            conn.Open();
            SqlCommand cmd = new SqlCommand();
            cmd.Connection = conn;
            cmd.CommandText = "SELECT intcolumn FROM tbl";
            SqlDataReader r = cmd.ExecuteReader();
            using (r)
            {
                while (r.Read()) //skip to the next row
                {
                    value *= r.GetSqlInt32(0);
                }
            }
        }
    }
}

This can then be invoked using the T-SQL EXEC statement:

EXEC Product @p OUTPUT

Example: Product as a T-SQL Stored Procedure that Uses a Cursor

Finally, a T-SQL stored procedure can be created that executes a query and performs the calculation using a T-SQL cursor to iterate over the data.

T-SQL:

create procedure TSQL_ProductProc (@product int output)
as
begin
    declare @sales int
    declare c insensitive cursor for select intcolumn from tbl

    set @product = 1

    open c    
    fetch next from c into @sales
   
    while @@FETCH_STATUS = 0
    begin
        set @product = @product * @sales
        fetch next from c into @sales
    end
   
   close c
   deallocate c
end
go

Summary

The decision of whether to use a UDA or one of the other solutions to produce results depends on several factors, including the composability requirements, specifics of the aggregation algorithm, and the need for side-effects.

A UDA is actually a standalone object that can be used from any T-SQL query, generally in the same places that a system aggregate can be used. There are no assumptions about the query that it operates on. For example, it can be included in view definitions (although not in indexed views) and in scalar sub-queries.

UDAs may be evaluated before an ORDER BY clause of a query, so there are no guarantees about the order in which values are presented to the aggregation function. Therefore, if the aggregation algorithm must consume values in a particular order, a UDA cannot be used. Similarly, a UDA consumes the values from an entire group and returns a single value. If this does not fit with the problem then another technique must be used.

Also, a UDA can perform no data access, nor have side-effects; if either of these are necessary then a stored procedure should be used.

Although UDAs have a number of restrictions, they are likely to provide the best performance of the options presented, so aggregation should generally be performed through a UDA unless other requirements prevent it.

User Defined Types

Now we come to one of the more powerful, but often misunderstood, features of SQL Server 2005. With user-defined types (UDTs), one can extend the scalar type system of the database. This goes beyond just defining an alias for a system type, which has been available in previous releases of SQL Server. Defining a UDT is as simple as writing a class in managed code, creating an assembly, and then registering the type in SQL Server by using the CREATE TYPE statement. The following is the skeleton of code illustrates the contract that a UDT must satisfy:

In Visual Basic .NET:

<Serializable()> _
<Microsoft.SqlServer.Server.SqlUserDefinedType(Format.Native)> _
Public Structure SimpleType
    Implements INullable

    Public Overrides Function ToString() As String
        ...
    End Function

    Public ReadOnly Property IsNull() As Boolean Implements _
        INullable.IsNull
        ...
    End Property

    Public Shared ReadOnly Property Null() As SimpleType
        ...
    End Property

    Public Shared Function Parse(ByVal s As SqlString) As SimpleType
        ...
    End Function

End Structure

In C#:

[Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedType(Format.Native)]
public struct SimpleUdt : INullable
{
    public override string ToString() { ...  }
    public bool IsNull { ...  }
    public static SimpleUdt Null
    {
        get { ...  }
    }
    public static SimpleUdt Parse(SqlString s) { ...  }
}

This can then be loaded and used in T-SQL:

CREATE TYPE simpleudt FROM [myassembly].[SimpleUdt]
CREATE TABLE t (mycolumn simpleudt)

When to Create a UDT

UDTs in SQL Server 2005 are not an object-relational extensibility mechanism; they are a way to extend the scalar type system of the database. The scalar type system includes the columnar types that ship with SQL Server (e.g., int, nvarchar, uniqueidentifier, etc.). With UDTs, a new type can be defined that can be used in place of a built-in scalar type. Create a UDT if the type is an atomic value that is appropriate to be modeled as a column.

Good candidates for implementation as a UDT include custom date or time data types in various calendars, and currency data types. A UDT is a single class that exposes all behaviors available on the type and encapsulates the underlying data stored by the type: all data access uses the programmatic interface of the UDT. Often one can leverage existing functionality in the .NET framework—such as the internationalization or calendar functionality—to provide functionality that would be hard to provide otherwise.

When Not to Create a UDT

A UDT should not be used to model complex business objects, such as employees, contacts, or customers. A UDT is treated as a unit by SQL Server, to which it is opaque. Some issues with complex UDTs include the 8KB size limitation on types, indexing limitations, and the fact that the entire value must be updated when any value in the UDT is updated.

Even in cases where the type is relatively simple, such as a point or color class, it is better to use a mid-tier object relational mapping technology rather than a UDT. UDTs should generally be saved for cases where the data is truly atomic.

Factors to Consider When Designing a UDT

Since UDTs are columnar, indexes can be defined over entire UDT values, as can referential integrity constraints, such as uniqueness. UDTs can also be used in comparison and ordering scenarios.

Comparing UDT values is accomplished by comparing the underlying binary representation of the type. If Format.Native is used as the persistence mechanism, then the persisted form is created using the same field order as what is defined in the type, so care should be taken to ensure that they are laid out in the correct order.

Except for comparison, every operation on a UDT requires that the UDT value be de-serialized and a method invoked. This pattern has a cost associated with it, which should be accounted for when assessing whether to model a type as a UDT. UDTs are best used when there are complex behaviors of the type which need to be modeled. If the type is relatively simple then it may be better to avoid the UDT construct.

Finally, one can use static methods of a UDT as a convenient packaging mechanism to store a library of related functions. Static methods can be invoked from T-SQL using the following syntax:

select Type::Function(@arg1)

Example: Non-Western Dates

We would like to store date and time values using the Um Al Qura calendar, which is different from the Gregorian calendar used by the SQL Server datetime data type. We would like this data type to have the same basic set of behaviors, namely string conversion, the ability to retrieve date parts, perform date arithmetic, etc.

The following example of a user defined type is a simple implementation of this data type. It uses the UmAlQuraCalendar type, which is new in the .NET Framework, version 2.0. It would be straightforward to augment this example to provide the full compliment of methods needed.

Um Al Qura UDT in C#:

using System;
using System.Data;
using System.Data.Sql;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Globalization;

[Serializable]
[SqlUserDefinedType(Format.Native, IsByteOrdered = true)]
public struct UmAlQuraDateTime : INullable
{

    /*
     * Private state.
     */ 
    
    private long dtTicks;
    private bool isNull;

    // Calendar object used for all calendar-specific operations
    private static readonly UmAlQuraCalendar s_calendar = 
        new UmAlQuraCalendar();

    // For correct formatting we need to provie a culture code for 
    // a country that uses the Um Al Qura calendar: Saudi Arabia.
    private static readonly CultureInfo ci = 
        new CultureInfo("ar-SA", false);
    

    /*
     * Null-Handling
     */

    // get a null instance
    public static UmAlQuraDateTime Null
    {
        get
        {
            UmAlQuraDateTime dt = new UmAlQuraDateTime();
            dt.isNull = true;
            return dt;
        }
    }

    public bool IsNull
    {
        get
        {
            return this.isNull;
        }
    }

    /*
     * Constructors
     */

    public UmAlQuraDateTime(long ticks)
    {
        isNull = false;
        dtTicks = ticks;
    }

    public UmAlQuraDateTime(DateTime time) : this(time.Ticks)
    {
    }

    /*
     * Factory routines.
     */

    public static UmAlQuraDateTime Parse(SqlString s)
    {
        if (s.IsNull) return Null;
        DateTime t = DateTime.Parse(s.Value);
        return new UmAlQuraDateTime(t);
    }

    public static UmAlQuraDateTime ParseArabic(SqlString s)
    {
        if (s.IsNull) return Null;
        DateTime t = DateTime.Parse(s.Value, ci);
        return new UmAlQuraDateTime(t);
    }

    public static UmAlQuraDateTime FromSqlDateTime(SqlDateTime d)
    {
        if (d.IsNull) return Null;
        return new UmAlQuraDateTime(d.Value);
    }

    public static UmAlQuraDateTime Now
    {
        get
        {
            return new UmAlQuraDateTime(DateTime.Now);
        }
    }

    /*
     * Conversion Routines
     */

    public DateTime DateTime
    {
        get { return new DateTime(this.dtTicks); }
    }

    public SqlDateTime ToSqlDateTime()
    {
        return new SqlDateTime(this.DateTime);
    }

    public override String ToString()
    {
        return this.DateTime.ToString(ci);
    }

    public String ToStringUsingFormat(String format)
    {
        return this.DateTime.ToString(format, ci);
    }

    /*
     * Methods for getting date parts.
     */

    public int Year
    {
        get
        {
            return s_calendar.GetYear(this.DateTime);
        }
    }

    public int Month
    {
        get
        {
            return s_calendar.GetMonth(this.DateTime);
        }
    }

    public int Day
    {
        get
        {
            return s_calendar.GetDayOfMonth(this.DateTime);
        }
    }

    /*
     * Date arithmetic methods.
     */

    public UmAlQuraDateTime AddYears(int years)
    {
        return new 
            UmAlQuraDateTime(s_calendar.AddYears(this.DateTime, years));
    }

    public UmAlQuraDateTime AddDays(int days)
    { 
        return new 
            UmAlQuraDateTime(s_calendar.AddDays(this.DateTime, days));
    }   

    public double DiffDays(UmAlQuraDateTime other)
    {
        TimeSpan diff = DateTime.Subtract(other.DateTime);
        return diff.Days;
    }
}

Once this type is loaded in SQL Server, the type can be used via T-SQL. Here are some T-SQL examples using this UDT along with the results they produce.

First we parse an Um Al Qura date and print it out in two formats along with the western equivalent:

DECLARE @d UmAlQuraDateTime
SET @d = UmAlQuraDateTime::ParseArabic('01/02/1400')
PRINT @d.ToString()
PRINT @d.ToStringUsingFormat('F')
PRINT @d.ToSqlDateTime()

This produces:

ms345136.sqlclrguidance_1(en-US,SQL.90).gif

We can convert western dates into Um Al Qura as well:

DECLARE @n DateTime
SET @n = 'March 20, 2005'
DECLARE @d UmAlQuraDateTime
SET @d = UmAlQuraDateTime::FromSqlDateTime(@n)
PRINT @n
PRINT @d.ToString()

The result:

ms345136.sqlclrguidance_2(en-US,SQL.90).gif

Finally, we can create and modify tables with columns of this type:

CREATE TABLE dates (
   western DateTime,
   umalqura UmAlQuraDateTime
)

INSERT INTO dates(western) VALUES ('June 1, 2005')
INSERT INTO dates(western) VALUES ('July 1, 2005')

UPDATE dates 
SET umalqura = UmAlQuraDateTime::FromSqlDateTime(dates.western)

SELECT western, umalqura.ToString() as umalqura FROM dates

This outputs a table:

ms345136.sqlclrguidance_3(en-US,SQL.90).gif

Conclusion

This paper has presented guidelines, concrete usage scenarios, and samples using CLR integration features in SQL Server 2005. Database application developers and architects should use this paper in conjunction with documentation on other SQL Server 2005 features, such as Transact-SQL, XML, and Service Broker.

In the future, we plan to provide more articles in this area with guidelines addressing manageability, monitoring, and troubleshooting aspects of CLR integration.