Writing Language-Portable Transact-SQL

 

Ken Henderson

June 11, 2004

Summary: Ken discusses the enhanced localization features in SQL Server 2000 that make it much easier to write Transact-SQL code that is portable across languages. (9 printed pages)

 

Introduction

In this month's column, I'll discuss a few of the issues facing developers who wish to write Transact-SQL code that is portable across languages. The introduction of Unicode data types, collations, and various other internationalization and localization features in SQL Server 7.0 has made writing language-portable Transact-SQL code much easier than in previous releases. SQL Server 2000 enhanced the internationalization features in SQL Server 7.0 by adding such things as column-level collations, so the tools available to you as a developer have simply gotten better over time.

The SQL Server Books Online (BOL) contains a brief section on writing language-portable code that you should probably begin with. If you haven't yet read the BOL topic, "Writing International Transact-SQL Statements," I'd recommend that you do so before reading this column.

Dates

Let's begin our discussion of writing portable code by exploring what's involved with handling dates in a portable fashion. Although people usually associate writing portable database code with handling regular character strings differently, writing portable date-handling code is a good way to begin because it basically amounts to being aware of a few simple rules and options.

The first thing I'll mention regarding handling dates in Transact-SQL in a portable manner is to avoid assumptions about date format or the first day of the week. The first of these probably seems pretty obvious—most people know that dates are formatted differently around the world. The second, however, is probably not so obvious. The day that's considered the first day of the week varies between cultures. Take this code, for example:

set language us_english
select datepart (dw, '20060606')

set language british
select datepart (dw, '20060606')

It returns:

Changed language setting to us_english.
            
----------- 
3

Changed language setting to British.
            
----------- 
2

These two queries return different results because the day that's considered the first day of the week differs between the United States and Great Britain. You can check this yourself by querying the system function @@datefirst:

Changed language setting to us_english.
     
---- 
7

Changed language setting to British.
     
---- 
1

You can change which day is considered the first day of the week (and, hence, the value returned by @@datefirst) via the SET DATEFIRST command. The default varies based on culture.

There are two easy ways to write code that uses datepart(dw,...) in a manner that's date agnostic. The first is simply to call SET DATEFIRST at the head of any procedures that make assumptions about the first day of the week. This will override the connection settings for DATEFIRST for the duration of the procedure. The second is to normalize the value returned by @@datefirst such that it's language-independent. Here's some code that shows how to do this:

declare @ndf_dw int

set language us_english
select @ndf_dw = (@@datefirst + datepart(dw, '20060606')) % 7
select @ndf_dw, datepart(dw, '20060606')

set language british
select @ndf_dw = (@@datefirst + datepart(dw, '20060606')) % 7
select @ndf_dw, datepart(dw, '20060606')

If you run this code, you'll see that the value returned by datepart(dw,...) varies based on the current language setting, but the value returned for @ndf_dw is the same.

Date names

The names returned by Transact-SQL date functions for months and days of the week will vary based on the language setting. As pointed on in the Books Online, this means that you should use numeric date parts rather than name strings for month and day of week comparisons. For example:

select DATENAME(dw,'20060606')

returns:

------------------------------ 
Tuesday

when the default language is set to U.S. English, but returns:

------------------------------ 
Mardi

when it is set to French. Obviously, if you write code that relies on a date string for a particular language being returned, your code will likely break when presented with strings returned when the default language is set to some other language. Using numeric date parts alleviates this problem.

Displaying date values

Of course, you'll want to use the names of date parts when rendering data for user consumption or display because these will generally be more meaningful. For basic, two-tiered applications, the simplest option may be to allow SQL Server to return these names for you rather than translating them through some other means in the client application. If a login's language isn't set correctly, this obviously won't work as expected.

For more complex environments involving a middle tier or connection pooling, it may be impractical to assume that a login's language setting will be set correctly for a particular end user. A better option in these scenarios is to return binary date/time values from your T-SQL code that are unambiguous and can therefore be translated to meaningful strings in the client application.

Comparing and storing dates

When specifying dates in DML or comparison statements, be sure to use constants that have the same meaning in all languages. The simplest and most foolproof way to do this from a client application is to submit such statements as RPC events to the server (using the SqlCommand object in managed code, for example) and pass dates using explicit parameters. Parameters passed from a client application encoded in their native format are inherently unambiguous.

Another option is to use ODBC escape clauses to denote specific formats that are consistent across languages. The application need not be an ODBC application because the escape clauses are interpreted by the server. The Books Online topic mentioned earlier, "Writing International Transact-SQL Statements," details the supported escape clauses.

Applications can also use undelimited character strings whose parts are sequenced in most-to-least significance order (i.e., yyyymmdd). SQL Server will interpret dates formatted this way correctly regardless of the language setting.

Another alternative is to use the CONVERT() Transact-SQL function and its style parameter to explicitly specify the format a date is to be converted to or from. This disambiguates date/time strings and is portable across languages.

Unicode

The easiest and most straightforward thing you can do to write Transact-SQL code that is portable between languages is to use Unicode data types to represent and store character data. This means that you should use nchar instead of char, nvarchar instead of varchar, and ntext instead of text. While using Unicode data types limits you somewhat in terms of the length of a string you can store and can be a little slower and a little more cumbersome than using non-Unicode types, this is the simplest way to deal with language portability issues and is the only solution that simply works without additional coding considerations.

Along the same lines, you must be careful to precede Unicode character string constants with a capital letter N prefix as specified in the Books Online topic "Using Unicode Data." As Knowledge Base article 239530, INF: Unicode String Constants in SQL Server Require N Prefix points out, failing to do this causes SQL Server to convert the character string in question to the non-Unicode code page of the current database before using it. Here's an example:

-- Assumes the default code page is not Greek
create table #t1 (c1 nchar(1))

insert #t1 values(N'Ω')
insert #t1 values('Ω')

select * from #t1

This returns:

c1   
---- 
Ω
O

Both inserts attempt to insert a single character, a Greek omega (Ω), into a table. As you can see, there's no data loss with the first row insert. In the second row, however, the Ω has been converted to a different character (a capital O) because the lack of an N prefix in the second INSERT statement caused the string to be converted to the default code page, and Ω does not exist in this code page (capital O is the closest approximation).

Even though the table's column uses a Unicode data type, this does not prevent the data loss. The data loss occurs because the capital N prefix is omitted in the source data and SQL Server converts it to the default code page before inserting it. As with date values, if you submit queries featuring Unicode parameters as RPC events, you alleviate needing to distinguish between Unicode and non-Unicode strings via the N prefix—the RPC parameter itself defines its data type to the server.

Collations

Have a look at the code below and see if you can determine what's wrong with it:

create function hexnum(@hexstr varchar(10))
returns bigint
as
begin
if left(@hexstr,2) in ('0x','0X') set @hexstr=substring(@hexstr,3,10)  -- Lop off 0x prefix
declare @i int, @res bigint, @l int, @c char
select @i=1, @l=len(@hexstr), @res=0
if @hexstr is null OR @l=0 return null
while @i<=@l begin
   set @c=upper(substring(@hexstr,@i,1))
   if @c<'0' OR @c>'F' return(null)
set @res=@res+cast(1.0 as bigint)*case when isnumeric(@c)=1 then
cast(@c as int) else ascii(@c)-55 end*power(16,@l-@i)
   set @i=@i+1
end
return(@res)
end

The obvious purpose of the hexnum() UDF is to translate a hexadecimal string that's passed into the function into its integer value. Do you see what's wrong with the function? I'll give you a hint: it has to do with collation dependency. Let's take a closer look at the line:

if @c<'0' OR @c>'F' return(null)

Whether SQL Server 2000 considers a character less than or greater than another character is collation-dependent: a collation defines the bit patterns that represent each character in a character string and the rules controlling how characters are sorted and compared.

The purpose of the above line of code is to filter out invalid inputs to the function. If a character is not between 0 and 9 or between A and F, it cannot be a valid hexadecimal digit, and therefore should cause the function to fail. The code tries to take advantage of the fact that, in the default ASCII character set, the characters 0 through 9 come immediately before A. The problem with the way it's written is that the check is collation dependent and will fail to work correctly for certain inputs. Take this input, for example:

select dbo.hexnum('0xÈ') 

It returns:

-------------------- 
145

which is obviously incorrect. This is because the character comparison is collation-dependent. The example character happens to fall between 0 and F in the default collation, but is still not a valid hexadecimal digit. A beginner might rewrite this line of code as:

if @c NOT LIKE '[0-9ABCDEF]' return(null)

However, LIKE isn't the optimal solution here. It's not needed and isn't as efficient as simpler techniques. So, how can we write the check such that it's impervious to collation nuances? How about this:

if not ascii(@c) between @ascii0 and @asciiF return(null)

This, of course, necessitates a couple of new variables, @ascii0 and @asciiF. We can assign them easily enough in the SELECT statement earlier in the function:

select @i=1, @l=len(@hexstr), @res=0, @ascii0=ascii('0'), @asciiF=ascii('F')

Here's the revised function in its entirety:

create function hexnum(@hexstr varchar(10))
returns bigint
as
begin
if left(@hexstr,2) in ('0x','0X') set @hexstr=substring(@hexstr,3,10)  -- Lop off 0x prefix
declare @i int, @res bigint, @l int, @c char, @ascii0 int, @asciiF int
select @i=1, @l=len(@hexstr), @res=0, @ascii0=ascii('0'), @asciiF=ascii('F')
if @hexstr is null OR @l=0 return null
while @i<=@l begin
   set @c=upper(substring(@hexstr,@i,1))
   if not ascii(@c) between @ascii0 and @asciiF return(null)
   set @res=@res+cast(1.0 as bigint)*case when isnumeric(@c)=1 then 
cast(@c as int) else ascii(@c)-55 end*power(16,@l-@i)
   set @i=@i+1
end
return(@res)
end

Let's try it again with our test string and see whether it behaves correctly:

SELECT dbo.hexnum('0xÈ') 

This returns:

-------------------- 
NULL

which is what we want. Tests with other inputs seem to confirm that the function works as expected and correctly filters out invalid values.

Let's look at another collation-related coding issue. Let's say you're using a default database collation of Latin1_General_BIN (in order to ensure case-sensitivity) and have a business need to filter queries against a particular table with case-insensitive predicates. You might be tempted to do something like this:

select au_id FROM authors WHERE LOWER(au_lname) = LOWER(@au_lname)

However, you shouldn't assume that the terms uppercase and lowercase have the exact same meaning in every collation and language. In Turkish, an uppercase "i" isn't "I", it's "İ" (note the dot). Even with identical inputs, LOWER(@au_lname) may have different outputs on different servers.

This technique also has performance issues. By wrapping the column in a function, the code interferes with the optimizer's ability to identify indexes to service the query. To work around this, you might be tempted to create a computed column that converts your column to a specific character case and create an index over that computed column, like this:

alter table authors
add au_lname_lower as LOWER(au_lname)

create index foo on authors (au_lname_lower)

select au_id from authors where au_lname_lower=LOWER(@au_lname)

However, adding a new column and index isn't the optimal SQL Server 2000 solution. It's not needed and isn't as efficient as simpler techniques. So, what changes can we make such that the query is impervious to collation nuances? The best solution here is to define the column in question with a case-insensitive collation to begin with, like so:

create table authors 
(...
au_lname varchar(40) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
...)

Any other solution amounts to an attempt to code collation rules in Transact-SQL, something that's unnecessary and, at best, questionable in SQL Server 2000.

The moral of the story here is that traditional notions about character comparisons that have been around as long as there have been programming languages don't necessarily apply to languages such as Transact-SQL that are collation-aware. You have to code differently in collation-sensitive environments, but the upshot is that your code will then be portable across languages.

Conclusion

Writing portable Transact-SQL code is not difficult if you can use Unicode data types and follow a few simple rules. It is becoming increasingly more common for applications to need to be language-portable. If your business needs require that you write Transact-SQL code that runs under multiple languages, SQL Server 2000 provides the tools you need.

The Guru's Guide to SQL Server Architecture and Internals

The Guru's Guide to SQL Server Stored Procedures, XML, and HTML

 

SQL Server for Developers

Ken Henderson is a husband and father living in suburban Dallas, Texas. He is the author of eight books on a variety of technology-related topics, including the recently released The Guru's Guide to SQL Server Architecture and Internals (Addison-Wesley, 2003). An avid Dallas Mavericks fan, Ken spends his spare time watching his kids grow up, playing sports, and gardening.