International Features in Microsoft SQL Server 2005

Article
10/08/2009

SQL Server User Education
Microsoft Corporation

February 2007
Revised April 2007

Applies to:
Microsoft SQL Server 2005
Unicode

Summary: This white paper introduces Microsoft SQL Server developers to the international features of Microsoft SQL Server 2005. Topics covered include an explanation of Unicode, added support for supplementary characters in SQL Server 2005, the changes in collation in different versions of SQL Server, changes in data types, performance, updates on data providers, and new international support features in SQL Server 2005 Analysis Services and Integration Services. (59 printed pages)

Introduction
Unicode Support in SQL Server 2005
Data Types in SQL Server 2005
Performance and Storage Space
Migration of Metadata Information in System Tables
Collations
Server-Client Communication (Data-Access Technologies)
Multilingual Data in the User Interface
Multilingual Transact-SQL
Locale Support in SQL Server 2005
SQL Server 2005 Integration Services
Using Command-Line Utilities with Multilingual Data
International Features in SQL Server Analysis Services
XML Support in SQL Server 2005
Enhancements to Full-Text Search
Interacting with Other Database Products
Conclusion
Acknowledgments
Appendix A: Collation Suffixes
Appendix B: Locales Supported in Windows
Appendix C: SQL-Specific Sort Orders
Appendix D: Languages Supported in SQL Server 2005
Appendix E: Full-Text Languages Added in SQL Server 2005
Appendix F: Collations Updated in SQL Server 2005

Introduction

Microsoft SQL Server 2005 builds on the Unicode and XML support introduced in SQL Server 2000, and adds a powerful new set of development and query tools with SQL Server Management Studio and Business Intelligence Development Studio. Robust multilingual features make SQL Server 2005 a compelling database product and applications platform for the support of international operations and environments.

This white paper provides an overview of these features in a global context. It lists features related to international and multilingual requirements, and explains how design decisions can affect many aspects of a project.

Note This paper uses the following international fonts to provide examples of some international characters: Arial Unicode MS, Latha, Mangal, PmingLiu, Gulim, SimSun, and MS-Mincho. Not having these fonts installed will not seriously affect the usability of this paper.

Unicode Support in SQL Server 2005

Unicode support is the foundation for multilingual support in SQL Server 2005.

Unicode is a standard that was created by the Unicode Consortium, an organization that promotes a single character set for all languages. SQL Server 2005 supports the Unicode Standard, Version 3.2. Version 3.01 of the Unicode Standard is identical to ISO-10646, an international standard that matches all of the code points in Unicode.

Unicode works by supplying a unique code point for every character, regardless of platform, program, or language. A program that supports Unicode can handle data in any language. Because it is designed to cover all the characters of all the languages of the world, there is no need for different code pages to handle different sets of characters.

Because all Unicode systems consistently use the same bit patterns to represent all characters, there is no problem with characters being converted incorrectly when moving from one system to another.

The easiest way to manage character data in international databases is always to use the Unicode nchar, nvarchar, and nvarchar(max) data types, instead of their non-Unicode equivalents: char, varchar, and text. That way, clients will see the same characters in the data as all other clients. If all the applications that work with international databases also use Unicode variables instead of non-Unicode variables, character translations do not have to be performed anywhere in the system.

Note The ntext data type will be removed in a future version of Microsoft SQL Server.

Unicode code points and the characters they represent are separate from the glyph that is used for visual rendering. A glyph is defined by the ISO standard (ISO/IEC 9541-1) as "a recognizable abstract graphic symbol which is independent of a specific design." Therefore, a character is not necessarily always represented by the same glyph, or even a unique glyph. The typeface that you choose determines what glyph will be used to represent a particular code point or series of code points.

For more information, see the Unicode Consortium Web site.

Encodings

Unicode maps code points to characters, but does not actually specify how the data will be represented in memory, in a database, or on a Web page. This is where the encoding of Unicode data comes into play. There are many different encodings for Unicode. Most of the time you can simply choose a Unicode data type and not worry about these details; however, it is important to understand the encoding when you are in the following situations:

When dealing with an application that may encode Unicode differently
When sending data to other platforms (non-Microsoft Windows) or Web servers
When importing data from or exporting data to other encodings

The Unicode standard defines multiple encodings of its single character set: UTF-7, UTF-8, UTF-16, and UTF-32. This section describes these common encodings:

UCS-2
UTF-16
UTF-8

Generally, SQL Server stores Unicode in the UCS-2 encoding scheme. However, many clients process Unicode in another encoding scheme, such as UTF-8. This scenario frequently occurs in Web-based applications. In Microsoft Visual Basic applications, character strings are processed in the UCS-2 encoding scheme. Therefore, you do not have to explicitly specify encoding scheme conversions between Visual Basic applications and an instance of SQL Server.

SQL Server 2005 encodes XML data using Unicode (UTF-16). Data in a column of type xml is stored in an internal format as large binary objects (BLOBs) in order to support XML model characteristics, such as document order and recursive structures. Therefore, XML data retrieved from the server comes out in UTF-16; if you want a different encoding for data retrieval, your application must perform the necessary conversion on the retrieved UTF-16 data. XML Best Practices in SQL Server 2005 Books Online provides an example of how to explicitly declare an encoding for XML data retrieved from a varchar(max) column.

The UTF-16 encoding is used because it can handle 2-byte or 4-byte characters, and is processed according to a byte-oriented protocol. These qualities make UTF-16 well-suited for traversing different computers that use different encodings and byte-ordering systems. Because XML data is typically shared widely across networks, it makes sense to maintain the default UTF-16 storage of XML data in your database, and when you export XML data to clients.

UCS-2

UCS-2 is a predecessor of UTF-16. UCS-2 differs from UTF-16 in that UCS-2 is a fixed-length encoding that represents all characters as a 16-bit value (2 bytes), and therefore does not support supplementary characters. UCS-2 is frequently confused with UTF-16, which is used to internally represent text in the Microsoft Windows operating systems (Windows NT, Windows 2000, Windows XP, and Windows CE), but UCS-2 is more limited.

Note For up-to-date information about the use of Unicode in the Windows operating system, see Unicode in the Microsoft Developer Network (MSDN) Library. It is recommended that a Windows application use UTF-16 internally and convert as part of a "thin layer" over the interface only if another format must be used.

Information that is stored in Unicode in Microsoft SQL Server 2000 and Microsoft SQL Server 2005 uses the UCS-2 encoding, which stores every character as two bytes, regardless of which character is used. Therefore, the Latin letter "A" is treated the same way as the Cyrillic letter Sha ( Bb330962.intlftrql200516(en-US,SQL.90).gif ), the Hebrew letter Lamed ( Bb330962.intlftrql200517(en-US,SQL.90).gif ), the Tamil letter Rra ( Bb330962.intlftrql200518(en-US,SQL.90).gif ), or the Japanese Hiragana letter E ( Bb330962.intlftrql200519(en-US,SQL.90).gif ). Each one has a unique code point (for these letters, the code points are U+0041, U+0248, U+05DC, U+0BB1, and U+3048, respectively, where each four-digit hexadecimal number represents the two bytes that UCS-2 uses).

Because UCS-2 allows for encoding of only 65,536 different code points, it does not natively handle supplementary characters, and instead treats supplementary characters as a pair of undefined Unicode surrogate characters that, when paired together, define a supplementary character. However, SQL Server can store supplementary characters without risk of loss or corruption. You can extend the capabilities of SQL Server to work with surrogate pairs by creating custom CLR functions. For more information about working with surrogate pairs and supplementary characters, see the "Supplementary Characters and Surrogate Pairs" section, later in this paper.

Note Supplementary characters are defined as A Unicode-encoded character having a supplementary code point. Supplementary code points are in the range between U+10000 and U+10FFFF.

UTF-8

UTF-8 is an encoding scheme that is designed to treat Unicode data in a way that is independent of the byte ordering on the computer. UTF-8 is useful for working with ASCII and other byte-oriented systems that require 8-bit encodings, such as mail servers that must span a vast array of computers that use different encodings, different byte orders, and different languages. Although SQL Server 2005 does not store data in UTF-8 format, it supports UTF-8 for handling Extensible Markup Language (XML) data. For more information, see the XML Support in SQL Server 2005 section of this paper.

Other database systems, such as Oracle and Sybase SQL Server, support Unicode by using UTF-8 storage. Depending on a server's implementation, this can be technically easier for a database engine to implement, because existing text management code on the server does not require major changes to deal with data one byte at a time. However, in the Windows environment, UTF-8 storage has several disadvantages:

The Component Object Model (COM) supports only UTF-16/UCS-2 in its APIs and interfaces. Therefore, if data is stored in UTF-8 format, constant conversion is required. This issue applies only when COM is used; the SQL Server database engine does not typically call COM interfaces.
The Windows XP and Windows Server 2003 kernels are both Unicode. UTF-16 is the standard encoding for Windows 2000, Windows XP, and Windows Server 2003. However, Windows 2000, Windows XP, and Windows Server 2003 are UTF-8 aware. Therefore, use of a UTF-8 storage format in the database requires many extra conversions. Typically, the extra resources needed for conversion do not affect the SQL Server database engine, but could potentially affect many client-side operations.
UTF-8 can be slower for many string operations. Sorting, comparing, and virtually any string operation can be slowed because characters do not have a fixed width.
UTF-8 often needs more than 2 bytes, and the increased size can make for a larger footprint on disk and in memory.

Despite these drawbacks, given the fact that XML has become an important standard for communication over the Internet, you may want to consider defaulting to UTF-8.

Note Older versions of Oracle and Java also use UCS-2, and cannot recognize surrogates. Oracle Corporation started supporting Unicode as a database character set in Oracle 7. Oracle currently supports two methods for Unicode data storage: (1) UTF-8 as the encoding for CHAR and VARCHAR2 character data types, and for all SQL names and literals; (2) UTF-16 for storage of the NCHAR, NVARCHAR, and NCLOB Unicode data types. Oracle allows you to use both methods simultaneously.

UTF-16

UTF-16 is the encoding standard at Microsoft and in the Windows operating system UTF-16 is an extension that was devised to handle an additional 1,048,576 characters. The need for a surrogate range was recognized even before Unicode 2.0 was released, as it became clear that the Unicode goal of having a single code point for every character in every language could not be achieved by using only 65,536 characters. For example, some languages, such as Chinese, require at least that many characters to encode characters such as historical and literary ideographs, which, although rarely used, are nonetheless important for publishing and scholarship. The next section provides more information about the surrogate range.

Like UCS-2, UTF-16 uses a little endian byte order, as does everything on Windows. Little endian, as opposed to big endian, means that the low-order byte is stored at the lowest address in memory. The ordering of bytes is important at the operating-system level. SQL Server, together with other applications that run on the Windows platform, uses the little endian byte order. Therefore, a hexadecimal word such as 0x1234 is stored in memory as 0x34 0x12. In certain cases you may need to explicitly reverse the byte order to correctly read the encoding for a character. SQL Server Integration Services provides functions for converting the byte order of Unicode text. For more information, see the SQL Server 2005 Integration Services section of this paper.

Supplementary Characters and Surrogate Pairs

Microsoft Windows normally uses UTF-16 to represent character data. The use of 16 bits allows representation of 65,536 unique characters. However, even this is not enough to cover all of the symbols used in human languages. In UTF-16, certain code points use another code point right after the first two bytes to define the character as part of the surrogate range.

In the Unicode standard, there are 16 planes of characters, with the potential to define as many as 1,114,112 characters. Plane 0, or the Basic Multilingual Plane (BMP), can represent most of the world's written scripts, characters used in publishing, mathematical and technical symbols, geometric shapes, all level-100 Zapf Dingbats, and punctuation marks.

Outside of the BMP, the characters in most planes are still undefined, but can be used to represent supplementary characters. Supplementary characters are used primarily for historical and classical literary documents to help with the encoding of the rich literary heritage of the Chinese, Korean, and Japanese languages. Supplementary characters also include runes and other historic scripts, musical symbols, and so forth.

In UTF-16, a pair of code points, called a surrogate pair, is used to represent characters outside the main character set (the BMP). The surrogate area is a range in Unicode from U+D800 to U+DFFF that contains 1,024 low surrogate values and 1,024 high surrogate values. A high surrogate (the range U+D800 to U+DBFF) and a low surrogate (the range U+DC00 to U+DFFF) are combined to give access to over a million possible characters.

It is not considered valid to have only one-half of a surrogate pair; to be valid, there must always be a high surrogate followed by a low surrogate. This makes checking for a surrogate a matter of range checking, which is easy compared to the rather complex rules that are required to detect DBCS (double-byte character system) characters.

Both SQL Server 2000 and SQL Server 2005 can store surrogate pairs, even though UCS-2 is not aware of surrogates. SQL Server treats the surrogate pairs as two undefined Unicode characters rather than as a single character. Such applications are usually referred to as surrogate-neutral or surrogate-safe, meaning that there is no intrinsic ability to interact with the data, but at least the data can be stored without loss.

In contrast, "surrogate-aware" applications can not only take surrogate pairs into consideration, but can also process combining characters and other characters that require special handling. A well-written application can detect separated surrogates, and recombine them, with just a few subroutines. Surrogate-aware applications include Microsoft Word and Internet Explorer 5 and later.

When working with supplementary characters in SQL Server, remember the following points:

Because surrogate pairs are considered to be two separate Unicode code points, the size of nvarchar(n) needs to be 2 to hold a single supplementary character (in other words, space for a surrogate pair).
Supplementary characters are not supported for use in metadata, such as in names of database objects. In general, text used in metadata must meet the rules for identifiers. For more information, see Identifiers in SQL Server 2005 Books Online.
Standard string operations are not aware of supplementary characters. Operations such as SUBSTRING(nvarchar(2),1,1) return only the high surrogate of the supplementary character's surrogate pair. The LEN function returns the count of two characters for every supplementary character encountered: one for the high surrogate and one for the low surrogate. However, you can create custom functions that are aware of supplementary characters. The StringManipulate sample in Supplementary-Aware String Manipulation, in SQL Server 2005 Books Online, demonstrates how to create such functions.
Sorting and searching behavior for supplementary characters may change depending on the collation. In the new 90_and BIN2 collations, supplementary characters are correctly compared, whereas, in older collations and standard Windows collations, all supplementary characters compare equal to all other supplementary characters. For example, the default Japanese and Korean collations do not handle supplementary characters, whereas Japanese_90 and Korean_90 do.

For more information about Unicode code points, best practices for designing surrogate-aware applications, and working with Unicode data, see Globalization Step-by-Step: Unicode Enabled. For information about character ranges supported in the Unicode standard, see the section on Unicode Regular Expressions in Unicode Technical Standard #18.

Combining Characters

Combining characters are characters that are used together with other characters to modify their appearance or meaning. The combined characters form a single glyph. For example, diacritics used in European languages are combining characters that can either appear as a character plus diacritic, or as a precomposed character.

In the .NET Framework, the sequence of combining characters is treated as a text element—that is, a unit of text that is displayed as a single character. A text element is different from a sort element. For example, in some collations, the letters "CH" are not combining characters; they are two separate text elements, but can be treated as one sort element.

Note SQL functions, on the other hand, generally treat combining characters the same as supplementary characters: They process them as two separate Unicode code points. For an example of how to create a custom function that more accurately counts and compares supplementary characters, see the StringManipulate sample.

The way that combining characters map to sort elements depends on both the Unicode standard and the collation. Some combined characters are always regarded as equivalent to their variant forms, no matter how many different code points they include (for example, Latin letter a plus a diacritic is treated as equivalent to the precomposed letter including diacritic), whereas in certain collations it is possible to sort or compare strings differently depending on the presence of the diacritic.

Combining characters were originally defined in Unicode 2.0. For more information, see the section of the Unicode 4.0.1 specification that deals with Special Areas and Format Characters. The Unicode Consortium also publishes a FAQ specifically related to combining characters and their processing. For more information about methods for processing combining characters in the .NET Framework, see Normalization in the .NET Framework Developer's Guide.

Support for GB18030

GB18030 (GB18030-2000) is a separate standard mandated by the People's Republic of China (PRC) for encoding Chinese characters. It specifies both an extended code page and a mapping table to Unicode. As of August 1, 2006, support for this character set is officially required for all software products sold in the PRC. GB18030 conformance includes requirements to support some previously unsupported languages—for example, Tibetan, Mongolian, Yi, and Uyghur.

In GB18030, characters can be 1, 2, or 4 bytes. Surrogate pairs are used to enable mapping of the GB18030 4-byte sequences to Unicode.

SQL Server 2005 provides support of GB18030-encoded characters by recognizing them when they enter the server from a client-side application. SQL Server 2005 converts and stores these characters natively as Unicode. After they are stored in the server, they are treated as Unicode characters in any subsequent operations performed on them. GB18030 does not have a system locale; it has only a code page identifier, to allow for conversions to and from Unicode. The Microsoft code page identifier for GB18030-2000 is 54936.

When you use GB18030 characters, remember that these characters can be used in ordering and comparison operations, but in collations older than SQL Server 90, comparisons are only based on their code points and not on other linguistically meaningful ways. Therefore, be careful when you use GB18030 characters in operations such as ORDER BY, GROUP BY, and DISTINCT, especially when GB18030 and non-GB18030 characters are included in the same operation. To enable meaningful string comparisons that use GB18030 characters, use the new SQL Server 90 collation version, signified by the 90 suffix added to its name. For example, instead of the Chinese_PRC collation, use Chinese_PRC_90.

To enable support of the GB18030 standard, you can install a support package, available from the Microsoft Product Help and Support portal, that includes a font file and libraries to support conversion between GB18030 and Unicode. The support package is a single, world–wide binary that works on Windows XP or Windows 2000. However, the system must have the optional East Asian language support installed. In Windows Vista, support for the GB18030 standard is included, including fonts and keyboard layouts for Chinese minority languages such as Tibetan, Mongolian, Yi, and Uighur. These languages use the Chinese (PRC) locale.

Note Some functions for converting GB18030 bytes to Unicode characters, such as BytesToUnicode, are not supported in Vista. When converting GB18030 bytes to Unicode characters in Vista, use the MultiByteToWideChar function.

For more information about support for this standard in Microsoft products, see the Microsoft Global Development and Computing portal.

Data Types in SQL Server 2005

This section describes the new data types in SQL Server 2005, and explains issues related to the use of SQL Server 2005 data types for storing international data.

Non-Unicode Text Types: char, varchar, text, varchar(max)

When you deal with text data that is stored in the char, varchar, varchar(max), or text data type, the most important limitation to consider is that only information from a single code page can be validated by the system. (You can store data from multiple code pages, but this is not recommended.) The exact code page used to validate and store the data depends on the collation of the column. If a column-level collation has not been defined, the collation of the database is used. To determine the code page that is used for a given column, you can use the COLLATIONPROPERTY function, as shown in the following code examples:

SELECT COLLATIONPROPERTY('Chinese_PRC_Stroke_CI_AI_KS_WS', 'CodePage')
936

SELECT COLLATIONPROPERTY('Latin1_General_CI_AI', 'CodePage')
1252

SELECT COLLATIONPROPERTY('Hindi_CI_AI_WS', 'CodePage')
0

The last example returns 0 (Unicode) as the code page for Hindi. This example illustrates the fact that many locales, such as Georgian and Hindi, do not have code pages, as they are Unicode-only collations. Those collations are not appropriate for columns that use the char, varchar, or text data type, and some collations have been deprecated. For a list of available collations and which collations are Unicode-only, see Collation Settings in Setup in SQL Server 2005 Books Online.

Important In SQL Server 2005, use the varchar(max) data type instead of the text data type. The text data type will be removed in a future version of Microsoft SQL Server. For more information, see Using Large-Value Data Types in SQL Server 2005 Books Online.

When Unicode data must be inserted into non-Unicode columns, the columns are internally converted from Unicode by using the WideCharToMultiByte API and the code page associated with the collation. If a character cannot be represented on the given code page, the character is replaced by a question mark (?). Therefore, the appearance of random question marks within your data is a good indication that your data has been corrupted due to unspecified conversion. It also is a good indication that your application could benefit from conversion to a Unicode data type.

If you use a string literal of a non-Unicode type that is not supported by the collation, the string is converted first using the database's default code page, which is derived from the default collation of the database.

Another problem you might encounter is the inability to store data when not all of the characters you wish to support are contained in the code page. In many cases, Windows considers a particular code page to be a "best fit" code page, which means there is no guarantee that you can rely on the code page to handle all text; it is merely the best one available. An example of this is the Arabic script: it supports a wide array of languages, including Baluchi, Berber, Farsi, Kashmiri, Kazakh, Kirghiz, Pashto, Sindhi, Uighur, Urdu, and more. All of these languages have additional characters beyond those in the Arabic language as defined in Windows code page 1256. If you attempt to store these extra characters in a non-Unicode column that has the Arabic collation, the characters are converted into question marks.

Unicode Text Types: nchar, nvarchar, nvarchar(max), ntext

The SQL-92 specification defines the data types prefaced with "N" (meaning "national" data types), but does not specifically require that the data types be used for Unicode; the actual definition of these data types is left to the database platform or developer. In SQL Server 2000 and SQL Server 2005, these data types are defined as being equivalent to UCS-2, which is a Unicode encoding. However, when you work with other database servers, it is important to know that the "N" data types do not specifically mean Unicode. The decision to define the "N" data types as Unicode is specific to Microsoft SQL Server.

The nvarchar(max) data type, which is new in SQL Server 2005, holds up to 2 gigabytes (GB) of data and is the preferred alternative to the ntext data type.

Important In SQL Server 2005, use the nvarchar(max) data type instead of the ntext data type. The ntext data type will be removed in a future version of Microsoft SQL Server. For more information, see Using Large-Value Data Types in SQL Server 2005 Books Online.

For the storage of complex scripts, such as Hindi and Tamil, make sure that the data is in the proper ordering. Many languages, such as Tamil, specify that certain letters be reordered when the text is rendered; therefore, the logical order of text as it is stored in memory can be different from the visual order that will be seen in a user interface. Data should always be stored in the proper logical order for any complex script language, which includes all of the Indic languages, Arabic, Farsi, Hebrew, and many others. The actual rendering of such data is a separate issue (see the Multilingual Data in the User Interface section of this paper).

Although the "N" columns can support data of any language or combination of languages, when you sort the data, you can only choose a single collation. To learn more about how to choose a collation, see the Collations section of this paper. None of the code page limitations that were mentioned previously in this paper apply to Unicode columns.

In SQL Server 2005, you can create additional functions to improve string manipulation and collation behavior with supplementary characters. For example, the StringManipulate sample for Microsoft SQL Server 2005 demonstrates supplementary character-aware string processing. This sample shows how to implement five Transact-SQL string functions that provide the same string manipulation functions as built-in string functions, but with additional supplementary character-aware capability to handle both Unicode and supplementary character strings.

CLR Data Types

Microsoft SQL Server gives you the ability to extend the SQL type system by defining a custom data type for use in SQL Server programming. These user-defined types are suited to the creation of custom date, time, currency, and extended numeric types, or for encoded or encrypted data.

A user-defined type (UDT) can be used to define the type of a column in a table, or a variable or routine parameter in the Transact-SQL language. An instance of a user-defined type can be a column in a table, a variable in a batch, a function or stored procedure, or an argument of a function or stored procedure. A user-defined type is implemented as a managed class in any one of the CLR languages, and then registered with SQL Server. For information about how to implement a user-defined type by using Visual Basic or Microsoft Visual C#, see Coding User-Defined Types in SQL Server 2005 Books Online.

You can use user-defined types to extend the scalar type system of the server, enabling storage of CLR objects in a SQL Server database. UDTs can contain multiple elements and can have special behaviors that you define. This differentiates them from the traditional alias data type, which consists of a single SQL Server system data type. For example, the Currency UDT sample that is provided in SQL Server 2005 Books Online supports handing amounts of money in the monetary system of a particular culture. You must define two fields: a string value for CultureInfo, which specifies the source of the currency (for example, en-us), and a decimal value for CurrencyValue, to represent the amount of money.

Because UDTs are accessed by the system as a whole, their use for complex data types may negatively affect performance. Complex data is generally best modeled using traditional rows and tables. SQL Server 2005 Books Online includes several samples that demonstrate how to create and work with custom user-defined types. The UTF8String sample for SQL Server 2005 demonstrates how to implement a user-defined data type that extends the type system of the database to provide storage for UTF8-encoded values. The new type also implements code to convert Unicode strings to and from UTF8. For details, see UTF8 String User-Defined Data Type (UDT) in SQL Server 2005 Books Online.

For additional examples, see CLR Programmability Samples in SQL Server 2005 Books Online.

xml Data Type

The xml data type allows you to store an XML fragment or document in SQL Server databases. Instances of the xml data type can be columns in a table, functions, or stored procedure arguments, or variables in a function or stored procedure. In addition, the xml data type can be specialized by indicating an associated XML schema that provides both validation constraints and type information for the data for the XML instance.

You perform operations on an instance of an xml data type by using built-in XML query methods. These methods accept queries and data-manipulation statements that are appropriate for XML data. You can then specify queries (XQuery) against the XML that is stored in the xml data-type variable or column, and apply updates (using insert, update, or delete) to the XML instance. You can also use an XSD to create an index for the XML column, which will improve query performance.

For more information about the xml data type, and the features in SQL Server 2005 to support handling of XML data, see the XML Support in SQL Server 2005 section of this paper.

Date/Time Types: datetime, smalldatetime

The date and time data types used in SQL Server 2000 and SQL Server 2005 have the following definitions:

datetime

Date and time data in the Gregorian calendar from January 1, 1753, through December 31, 9999, to an accuracy of one three-hundredth of a second (equivalent to 3.33 milliseconds or 0.00333 seconds).

smalldatetime

Date and time data in the Gregorian calendar from January 1, 1900, through June 6, 2079, with accuracy to the minute. smalldatetime values with 29.998 seconds or lower are rounded down to the nearest minute; values with 29.999 seconds or higher are rounded up to the nearest minute.

Microsoft SQL Server rejects data that falls outside of these ranges. The actual data is stored internally as two integers that represent the date and time in question (4-byte integers for datetime, and 2-byte integers for smalldatetime).

The data stored does not represent either a local time or a universal time, and contains no time zone information. If you need to convert dates to universal time, you can use one of the UTC date functions. The current UTC time is derived from the current local time and the time zone setting in the operating system of the computer on which the instance of Microsoft SQL Server is running. Because the value has no intrinsic locale-specific formatting, it is up to the developer to define conversions as needed. SQL Server supports many different locale-specific conversions that can be performed at the server instead of relying on custom solutions from developers. These date styles can be accessed through the CONVERT function, which takes a data type, an expression, and an optional style, as shown in the following table.

W/century	W/o century	Standard	Input (converting to datetime) Output (converting to text)
0 or 100	-	Default	mon dd yyyy hh:miAM (or PM)
101	1	U.S.	mm/dd/yy
102	2	ANSI	yy.mm.dd
103	3	British/French	dd/mm/yy
104	4	German	dd.mm.yy
105	5	Italian	dd-mm-yy
106	6	-	dd mon yy
107	7	-	Mon dd, yy
108	8	-	hh:mm:ss
9 or 109	-	Default + milliseconds	mon dd yyyy hh:mi:ss:mmmAM (or PM)
110	10	U.S.	mm-dd-yy
111	11	JAPAN	yy/mm/dd
112	12	ISO	yymmdd
13 or 113	-	Europe default + milliseconds	dd mon yyyy hh:mm:ss:mmm(24h)
114	14	-	hh:mi:ss:mmm(24h)
20 or 120	-	ODBC canonical	yyyy-mm-dd hh:mi:ss(24h)
21 or 121	-	ODBC canonical + milliseconds	yyyy-mm-dd hh:mi:ss.mmm(24h)
126	-	ISO8601 (no spaces)	yyyy-mm-dd Thh:mm:ss:mmm
130	-	Kuwaiti (Hijri)	dd mon yyyy hh:mi:ss:mmmAM
131	-	Kuwaiti (Hijri)	dd/mm/yy hh:mi:ss:mmmAM

The following example shows how the CONVERT function is used to represent the current date in a specified style:

SELECT CONVERT(char, GETDATE(), 100) AS [100]
RETURNS:
Aug 16 2000 11:50AM

You can then convert the data from a string to a date value in much the same manner:

SELECT CONVERT(datetime, 'Aug 16 2000 11:50AM', 100) AS [100]
RETURNS:
100
2000-08-16 11:50:00.000

If you convert dates in Style 130 (Kuwaiti or Hijri) to a char data type, the data may be corrupted if the collation is not one of the Arabic collations that use code page 1256 for Unicode conversions. For example, Figure 1 shows a column that has been converted to char and a second column that has been converted to nchar. In this example, the client computer uses the EN-US locale. Therefore, when you attempt to use the char data type, the Arabic characters are converted into question marks, whereas if you use the nchar data type, Arabic characters are displayed.

Figure 1. Using CONVERT with date/time data (Click on the picture for a larger image)

However, even the string represented using nchar is still not correctly formatted, as it would be on an Arabic client computer, because of limitations in the Query Editor. Figure 2 shows how the actual Hijri date string should appear.

Bb330962.intlissues2005fig02(en-US,SQL.90).gif

Figure 2. Hijri date string

The reason that Arabic cannot be rendered correctly is that complex scripts, such as Arabic, have shaping rules that control how the data is rendered. In the case of bidirectional (BIDI) languages such as Hebrew, shaping causes all data to be reversed. In the case of Arabic, shaping has a more marked effect, because the actual shapes of letters can change depending on the surrounding letters. This problem does not happen in versions of Windows after Windows 2000, or in any earlier, Arabic-enabled versions of Windows.

Additionally, the date string that is returned can cause problems in the bidirectional cases where it is needed, because the rules for the layout of bidirectional text used by an application, such as Internet Explorer, causes the date to appear as shown in Figure 3.

Bb330962.intlissues2005fig03(en-US,SQL.90).gif

Figure 3. Bidirectional date string example

This visual order (dd hh:mi:ss yyyy mon :) is obviously not the order that would be expected; the problem can be considered to be a general limitation of the 130 style in the CONVERT function. You can work around this problem by adding the proper Unicode control character in front of the string, as in the following query:

SELECT NCHAR(8207) + CONVERT(nchar, GETDATE(), 130)

The NCHAR function returns a character based upon the passed-in Unicode code point; 8207 or hexadecimal 0x200F is the Right-to-Left Marker (RLM), and it causes the string to be displayed correctly.

Performance and Storage Space

If you define a column using one of the Unicode data types, you may experience the following issues relating to storage space and speed.

Increased Storage Space

Unicode data types use two or more bytes per character, whereas non-Unicode data types use one byte for all non-DBCS text and two bytes for Asian languages that use DBCS. Therefore, unless your data uses one of the Asian code pages, when you convert to Unicode you will use twice as much space to store the data. The increased storage space requirements must be considered when you upgrade existing databases, or when you are deciding on the proper data types for a new project. If you are storing data only in a column that is on a single (non-Asian) code page, you might prefer to not use Unicode so you can save space on disk and in memory. However, the advantages of Unicode conversion generally outweigh the storage impacts.

Note When storing Asian DBCS data, the UCS-2 encoding method used by SQL Server 2005 tends to be more efficient than the UTF-8 method used by many other database programs. This is because UTF-8 uses three bytes to store most Asian language characters, while UCS-2 uses just two (with the exception of supplementary characters and combining characters). On the other hand, for non-DBCS languages, such as ASCII-based characters, UTF-8 usually uses only one byte per character, while UCS-2 uses two.

Speed Issues

The effect of using Unicode data types on performance is complex. You must consider the following issues:

If you are running on Windows XP, Windows Server 2003, or Windows Vista, the operating system expects Unicode data; therefore, in many cases non-Unicode columns have to be converted when you display data or use the operating system services.
When you are dealing with a native DBCS data format, additional time may be needed to load the larger amount of data.
If you are working between instances, between database server products, or exchanging data with other applications, the number of conversions can affect performance.
If you are dealing with Asian languages, Unicode is faster than the language-specific DBCS code page. This is because DBCS data does not have fixed width; it is a mixture of double-byte and single-byte characters.
If you are dealing with non-Asian languages, sorting Unicode data can be slower than sorting non-Unicode data.
Binary is the fastest sorting order, and it is case sensitive, but it can yield unexpected sort orders. If a binary sort order is selected, the case-sensitive, accent-sensitive, kana-sensitive, and width-sensitive options are not available.

Important To evaluate performance realistically, you must test in both the Unicode and non-Unicode scenarios.

Migration of Metadata Information in System Tables

The system tables in SQL Server 2005 store all the data in the system tables, including identifiers for objects, as Unicode. This minimizes the problems that can occur with different collations between databases and columns.

If you are migrating from SQL Server 2000 to SQL Server 2005, the only significant change is that SQL Server 2000 uses the Unicode 2.0 standard for the list of valid characters in identifiers, whereas SQL Server 2005 supports the Unicode Standard, Version 3.2.

You can directly upgrade instances of SQL Server 2000 Service Pack 3 (SP3) or later, and instances of SQL Server 7.0 SP4 or later, to SQL Server 2005. You can perform most upgrade operations through Setup; however, some components support or require migration of applications or solutions after running Setup.

Collations

Sort order is a critical but often overlooked part of database definition. Users tend to take sorting in their own alphabet for granted. However, some languages, such as Greek, Russian, and Thai, use different alphabets. Some languages, such as Japanese, use multiple alphabets, with complex rules for ordering. Even in the European languages, there is wide difference in the way that individual characters are handled. For example, Spanish users expect the letter combination "ch" to sort as a single character after the letter "h."

Basic sorting works through collations, which control sort orders, among other behaviors. Sorting can be optimized through the creation of indexes.

Sorting works together with a technique known as string normalization. This kind of normalization is different than the sense of the word that database developers are used to. String normalization refers to a method of comparing two strings so that they can be sorted: for example, you can specify whether different kana types should be treated as equivalent, or whether accents should be ignored.

For non-Unicode columns, collation has a second meaning that is very important: the collation specifies the code page for the data and therefore determines which characters can be represented. Data can be moved between Unicode columns without requiring special conversion or character mappings; data moved between non-Unicode columns is frequently garbled or corrupted, or at the very least cannot be displayed.

Collations in SQL Server 6.5 and Earlier

In SQL Server versions 6.5 and earlier, collations were used to specify the code page to use for language in general. There were numerous limitations related to different sort orders. For example, you could only support Western European languages if you used Latin-1. The collations also limited the number of different locales that you could represent in a single instance of SQL Server. In other words, you could only store or display the language used in a specific region. To use a different language you needed to set up a separate database or even a separate server.

These issues apply also to the collation of non-Unicode fields in later versions of SQL Server.

Collations in SQL Server 7.0

SQL Server 7.0 provided for one Unicode and one non-Unicode collation per server. The non-Unicode collations were defined as a combination of a code page and a sort order. Often, each code page can support more than one sort order; for example, Latin languages usually allow both case-sensitive and case-insensitive sorting. Simplified Chinese allows sorting both by stroke count and phonetic sorts.

In a Unicode collation, any character of any language can be included in the column, and the various collations that are provided were designed to make sure that any collation-specific differences are properly handled. In other words, you choose the "best fit" to give users the data they expect. For example, the general Unicode collation lets you sort Farsi data.

A Unicode collation consists of a locale and several comparison styles. Locales are usually named after countries or cultural areas. They sort characters according to the standard in that area. The Unicode collation still provides a sort order for all characters in the Unicode standard, but precedence is given to the locale specified. Any locale that does not have a supported, unique Unicode collation should use the General Unicode Collation.

The general Unicode sort order correctly handles not only the data but also the sorting of Afrikaans, Albanian, Arabic, Basque, Belarusian, Bulgarian, English, Faeroese, Farsi, Georgian (Traditional), Greek, Hebrew, Hindi, Indonesian, Malay, Russian, Serbian, Swahili, and Urdu.

One important change in SQL Server 7.0 was the provision of an operating system–independent model for string comparison, so that the collations between all operating systems from Windows 95 through Windows 2000 would be consistent. This string comparison code was based on the same code that Windows 2000 uses for its own string normalization, and is encapsulated to be the same on all computers and in all versions of SQL Server.

Collations in SQL Server 2000

In SQL Server 2000, the collation model was changed to eliminate inconsistencies between the two different systems of collations: Windows and SQL Server collations. A more flexible model was needed to handle all of the new requirements for collations. New collations were also required in SQL Server 2000 to handle the code page of non-Unicode columns.

To address these needs, a single, consistent model was designed to handle both Unicode and non-Unicode sorts. Each of the collations was combined with suffixes that help define whether the collation is sensitive to case, accent, width, or kana type. Appendix A contains a table that lists the suffixes. These 17 suffixes, combined as appropriate with the 40 languages supported in SQL Server 2000, created a total of 680 Windows collations.

The language names used for the collations were arbitrary and chosen to represent each unique supported code page for non-Unicode data and a sort order for all data. In many cases, multiple languages could be completely represented on a single code page, or processed by the same sort order used by another language, and in these cases the additional languages were removed from the list.

Collations in SQL Server 2005

SQL Server 2005 supports all languages supported by the Microsoft Windows operating systems. This means that SQL Server 2005 added support for new and updated collations in Windows Server 2003 and Windows XP. (These collations are installed as part of SQL Server 2005 setup. You control the choice of collation for the server and database during setup. Updates to the operating system do not affect collations used in SQL Server.)

An important part of the new collations were East Asian collations that support supplementary characters. Support was also added for string comparison of supplementary characters, based on code points, and a binary flag (BIN2) was introduced to enable true code point comparisons.

Binary collations sort and compare data in SQL Server based on the bit pattern for each character. Each binary collation in SQL Server maps to a specific language locale and ANSI code page, and each performs case-sensitive and accent-sensitive data sorts. Binary collations provide the fastest data sorts.

The Binary option (_BIN) sorts and compares data in SQL Server tables based on the bit patterns defined for each character. Binary sort order is case sensitive and accent sensitive. Binary is also the fastest sorting order. If this option is not selected, SQL Server follows sorting and comparison rules as defined in dictionaries for the associated language or alphabet.

The Binary-code point (_BIN2) option sorts and compares data in SQL Server tables based on Unicode code points for Unicode data. For non-Unicode data, Binary-code point will use comparisons identical to binary sorts. The advantage of using a Binary-code point sort order is that no data resorting is required in applications that compare sorted SQL Server data. As a result, a Binary-code point sort order provides simpler application development and possible performance increases.

Appendix E contains a list of the collations that were updated in SQL Server 2005. Unless you need backward compatibility with SQL Server 2000 or earlier, it is best to use the updated collations.

Collations in SQL Server 2005 include the following collation types: Windows collations, and SQL collations.

Windows Collations

Windows collations define rules for storing character data based on an associated Windows locale. (There are more Windows locales than there are SQL Server Windows collations.) The base Windows collation rules specify which alphabet or language is used when dictionary sorting is applied, as well as the code page used to store non-Unicode character data. In SQL Server, Windows collations are combined with a series of suffixes to additionally define sorting and comparison rules based on case, accent, kana, and width sensitivity. The complete Windows collation name is composed of the collation designator and the comparison styles.

In a Windows collation, comparison and sorting of non-Unicode data is implemented using the same algorithm as Unicode data. This provides consistency across data types within SQL Server, and it also provides developers with the ability to sort strings in their applications using the same rules that are used by SQL Server—that is, by calling the CompareStringW function of the Microsoft Win32 API.

Binary Collations

Binary collations sort data based on the sequence of coded values defined by the locale and data type. A binary collation in SQL Server defines the language locale and the ANSI code page to be used, enforcing a binary sort order. Binary collations are useful in achieving improved application performance due to their relative simplicity. For non-Unicode data types, data comparisons are based on the code points defined in the ANSI code page. For Unicode data types, data comparisons are based on the Unicode code points. For binary collations on Unicode data types, the locale is not considered in data sorts. For example, Latin_1_General_BIN and Japanese_BIN2 yield identical sorting results when used on Unicode data. That is because the collation compares Unicode as Unicode, and compares non-Unicode data as binary.

In SQL Server 2000, binary collations performed an incomplete code-point–to–code-point comparison for Unicode data. The first character was compared as WCHAR, followed by a byte-by-byte comparison. For backward compatibility, existing binary collation semantics will not be changed.

Binary collations in SQL Server 2005 include both the previous BIN collation and a new set of pure code-point comparison collations. Customers can choose to migrate to the new binary collations to take advantage of true code-point comparisons, and they should utilize the new binary collations when developing new applications. The new BIN2 suffix identifies collation names that implement the new code-point collation semantics. In addition, a new comparison flag is added corresponding to BIN2 for the new binary sort.

SQL Server automatically recommends a default collation based on the system locale. You should change the settings for a Windows default collation only if your installation of SQL Server must match the collation settings used by another instance of SQL Server, or if the collations setting must match the Windows system locale of another computer.

If you need to handle supplementary characters, change the default collation to one of the newer collations that support ordering and comparison operations on supplementary characters. These comparisons are based on code points only, and not on other linguistically meaningful ways. Only the 90 collation versions, which are signified by the 90 suffix added to their names, support these operations. For example, instead of the Japanese collation, use Japanese_90. If you do not use a supplementary-aware collation, be careful when you use supplementary characters in operations such as ORDER BY, GROUP BY, and DISTINCT, and especially when supplementary and non-supplementary characters are included in the same operation.

SQL Server Collations

SQL Server collations provide sort-order compatibility with earlier versions of SQL Server. (For a full list of SQL Server collations, see SQL Collation Name in SQL Server 2005 Books Online.) SQL Server collations are based on legacy SQL Server sort orders for non-Unicode data—for example, char and varchar data types—defined by SQL Server. The dictionary sorting rules for non-Unicode data are not compatible with any sorting routine provided by Windows operating systems, but the sorting of Unicode data is compatible with a particular version of Windows sorting rules. Because SQL Server collations use different comparison rules for non-Unicode data and for Unicode data, you might see different results for comparisons of the same data, depending on the underlying data type.

When you upgrade an instance of SQL Server, SQL Server collations can be specified for compatibility with existing instances of SQL Server. Because the default collation for an instance of SQL Server is defined during setup, it is important to specify collation settings carefully when:

Your application code depends in some way on the behavior of previous SQL Server collations.
You are going to use SQL Server 2005 replication with existing installations of SQL Server 6.5 or SQL Server 7.0.
You must store character data that reflects multiple languages.

If you have a mix of Unicode and non-Unicode columns in your database, you should primarily use Windows collations. Windows collations apply Unicode-based sorting rules to both Unicode and non-Unicode data. This means that SQL Server internally converts non-Unicode data to Unicode to perform comparison operations. This provides consistency across data types in SQL Server, and also provides developers with the ability to sort strings in their applications that use the same rules that SQL Server uses.

SQL collations, on the other hand, apply non-Unicode sorting rules to non-Unicode data, and Unicode sorting rules to Unicode data, by using a corresponding Windows collation for the Unicode data. This difference can cause inconsistent results for comparisons of the same characters. Therefore, if you have a mix of Unicode and non-Unicode columns in your database, they should all be defined by using Windows collations so that the same sorting rules are used across Unicode and non-Unicode data.

Backward Compatibility in Collations

The BIN binary collation provided in SQL Server 2000 performed an incomplete code-point to code-point comparison for Unicode data. These binary collations compared the first character as WCHAR, followed by a byte-by-byte comparison. This could cause Unicode character data to be sorted in an unexpected manner.

To ensure backward compatibility, existing binary collation semantics will not be changed. However, the functionality may have been superseded by newer collations. The Appendix lists collations that have been preserved for backward compatibility with SQL Server 2000 or SQL Server 7.0.

To obtain information about collations in a database, you can use the following system views:

Catalog view	Description
sys.databases	Returns information about the collation of a database.
sys.columns	Returns information about the collation of a column of a table or view.
COLLATIONPROPERTY	Returns information about collations in SQL Server 2005. You can pass the CodePage value, or LCID, which returns the Windows Locale ID, or Null for SQL collations. You can also specify Windows ComparisonStyle (returns Null for both Binary and SQL collations). ComparisonStyle can be used to verify that there is an equivalence between string normalization in Windows and in SQL Server for the Windows collation.
fn_helpcollations	Returns a list of available collations in SQL Server 2005. `SELECT * FROM ::fn_helpcollations()` In SQL Server 2005, this query returns 1,011 collations. (In SQL Server 2000, 753 collations.)
SERVERPROPERTY	Returns the collation associated with the server. `SELECT CONVERT(char, SERVERPROPERTY('collation'))`
DATABASEPROPERTYEX	Determines the collation of a database, for example: `SELECT CONVERT(char, DATABASEPROPERTYEX('pubs', 'collation'))`

Additional collations cannot be added, unless they are added in service packs or future versions.

Collations and Sorting of Data

As a general rule, every collation defined in SQL Server on a Unicode column will sort all defined Unicode characters. However, there are many different collations because there are many differences in how the data can be sorted. A good example of this is the Georgian modern sort. Although the traditional sorting of Georgian text places all of the characters in a specific order, it is common for modern usage to place certain rarely used characters at the end. These characters are HE ( Bb330962.intlftrql200520(en-US,SQL.90).gif ), HEI ( Bb330962.intlftrql200521(en-US,SQL.90).gif ), WE ( Bb330962.intlftrql200522(en-US,SQL.90).gif ), and HAR ( Bb330962.intlftrql200523(en-US,SQL.90).gif ). Thus, there are two ways to sort the Georgian alphabet: traditional, and modern.

Bb330962.intlissues2005fig04(en-US,SQL.90).gif

Figure 4. Traditional Georgian sort order

Bb330962.intlissues2005fig05(en-US,SQL.90).gif

Figure 5. Modern Georgian sort order

The existence of this collation does not prevent other Unicode data from being sorted according to the sorting order provided in the Latin1_General collation. All collations, in fact, sort Georgian in the traditional form, with the sole exception of the Georgian_Modern_Sort collations. In other words, the same general rules apply to all collations; only the exceptions change between collations.

Collation Levels

SQL Server 2005 supports setting collations at the following levels of a SQL Server 2005 instance:

Server level
Database level
Column level
Expression level

This section explains these collation levels and how they interact when multiple collations are used.

Collations Specified at the Server Level

The default collation of a SQL Server instance is set during setup. The default collation of the instance also becomes the default collation of the system databases. After a collation has been assigned to any object other than a column or a database, you cannot change the collation except by dropping and re-creating the object. Instead of changing the default collation of a SQL Server instance, you can specify the collation when you create a new database or database column.

In earlier versions, the collation was always set at the server level, and in many cases it was the only collation you ever needed to set. The server collation acts as the default whenever a new database is created on the server, unless you explicitly set the collation at the database level. Because each database has its own collation, the server-level collation is referenced only when the database is first created.

In SQL Server 2000, you could change the default collation for the server without rerunning Setup by using the Rebuild Master utility (RebuildM.exe), which is located in the Program Files\Microsoft SQL Server\80\Tools\BINN directory. For more information, see How to Rebuild the Master Database (Rebuild Master Utility) in SQL Server 2005 Books Online for SQL Server 2000. In SQL Server 2005, this utility is not supported; instead, you use the REBUILDDATABASE option in Setup.exe. However, because the server collation is not often used, instead of changing the default collation of an instance of SQL Server 2005, you can specify a default collation for each new database you create.

If you must change the default collation for an instance of SQL Server 2005, you must first script or back up your database, drop all user databases, and then rebuild the master database—specifying the new collation in the SQLCOLLATION property of the setup command, as follows:

start /wait setup.exe /qb INSTANCENAME=MSSQLSERVER REINSTALL=SQL_Engine REBUILDDATABASE=1 SAPWD=test SQLCOLLATION=SQL_Latin1_General_CP1_CI_AI

Collations at the Database Level

When a database is created, the COLLATE clause of the CREATE DATABASE statement can be used to specify the default collation of the database. If no collation is specified during database creation, the database is assigned the default collation of the model database. The default collation for the model database is the same as the default collation of the SQL Server instance.

Every database can have a unique collation, with the sort order being set at the database level. Figure 6 shows how the collation is set using SQL Server Management Studio.

Figure 6. Setting the database collation using the Options tab of the Database Properties window (Click on the picture for a larger image)

You can also set the collation order by using Transact-SQL. For example, to create a new database that uses the Czech sort order and is case- and accent-sensitive, use a statement like the following:

USE master
GO
CREATE DATABASE Products
ON 
( NAME = products_dat,
   FILENAME = 'c:\program files\microsoft sql 
server\mssql\data\products.mdf' )
COLLATE Czech_CS_AS
GO

In SQL Server 2005, you can also change the collation of an existing database by using SQL Server Management Studio. In SQL Server 2000, this functionality was not available in SQL Server Enterprise Manager. Instead, you had to use the ALTER DATABASE statement. For example, the following statement changes the collation of the Products database from Czech_CS_AS to Czech_CI_AI (case- and accent-sensitive to case- and accent-insensitive):

ALTER DATABASE Products
COLLATE Czech_CI_AI

Considerations Before Changing the Collation of a Database

If you have data in a text, varchar, or char field and there is no explicit collation on the column, changing the collation of the database alters the way that the data's encoding is interpreted, resulting in a form of corruption of any characters beyond the ASCII range.

Therefore, avoid changing the collation of any database that contains non-Unicode text data, unless the data is stored in columns that have their own explicit collations set.

To change the collation of a database, all of the following must be true:

No one else can be using the database.
No schema-bound object can be dependent on the database collation. Schema-bound objects include any of the following:
- User-defined functions and views created with SCHEMABINDING
- Computed columns
- CHECK constraints
- Table-valued functions that return tables with character columns with collations inherited from the default database collation
Changing the collation of the database does not create duplicates among any of the system names. If duplicates are found, an error is raised and the action to change the collation fails.

Objects that can potentially cause such duplications include:
- Object names (such as procedure, table, trigger, or view).
- Schema names (such as group, role, or user).
- Scalar-type names (such as system and user-defined types).
- Full-text catalog names.
- Column or parameter names within an object.
- Index names within a table.

For example, suppose your database contains two tables named Table1 and TABLE1, and you try to change the collation from French_CS_AS (case-sensitive, accent-sensitive) to French_CI_AS (case-insensitive, accent-sensitive). With the first collation, it is possible to have two tables. However, changing to the second collation causes duplicates.

Collations Specified at the Column Level

In SQL Server 2005, you can specify the collation of text in a particular column. This can be very useful, for example, if you need to force case sensitivity for a password column. Other scenarios where column-level collations might be useful involve multiple languages within the same table. For example, the customer name column might need to use Unicode with the Latin1_General collation to enable sorting that is appropriate for a variety of names, whereas the product line column might always contain Greek, and for this column a Greek collation might make sense. Figure 7 shows how you can choose a collation and set the sort-order options during table design.

Bb330962.intlissues2005fig07(en-US,SQL.90).gif

Figure 7. Specifying a collation

If you have not previously set a collation on the column, when you click the column, the dialog box lists <database default> for the Collation property of the column. To change the collation, click the ellipsis (...) button. This opens the Collation dialog box, where you select either a Windows collation or a SQL Server collation, and set sorting options.

When you select a Windows collation, you can specify whether the collation is sensitive to case, accent, kana type, and width.

You can also set column-level collations using Transact-SQL by adding a COLLATE clause to the column definition in the CREATE TABLE statement.

The following example shows how to use Transact-SQL to specify a collation for the job-description column that is set to Arabic, is case- and accent-insensitive, and is kana-type insensitive.

CREATE TABLE jobs
(
   job_id  smallint
      IDENTITY(1,1)
      PRIMARY KEY CLUSTERED,
   job_desc varchar(50)
      COLLATE Arabic_CI_AI_KS
      NOT NULL
      DEFAULT 'New Position - title not formalized yet',
)

You can also use the ALTER TABLE statement to change the collation at the column level (except for an ntext or text column). If the COLLATE clause is not specified, changing a column's data type causes the collation of the column to change to the default collation of the database.

In SQL Server 2005, you can also change the collation programmatically by using the column.collation property in SQL Management Objects (SMO).

The COLLATE clause can be used to change the collations of columns of only the char, varchar, nchar, and nvarchar data types. To change the collation of a user-defined alias data-type column, you must execute separate ALTER TABLE statements to change the column to a SQL Server system data type and change its collation, and then change the column back to an alias data type.

ALTER COLUMN cannot have a collation change if one or more of the following conditions exist:

If a CHECK constraint, FOREIGN KEY constraint, or computed column references the column that is being changed.
If any indexes, statistics, or full-text indexes are created on the column. Statistics created automatically on the column changed are dropped if the column collation is changed.
If a schema-bound view or function references the column.

You can insert or update values in a text column whose collation is different from the code page of the default collation of the database. SQL Server implicitly converts the values to the collation of the column.

Collations Specified in Expressions

Expression-level collations are set at the time a statement is run and they affect the way a result set is returned. This allows sorting results so that the ORDER BY clause can be language-specific.

During SQL Server setup, you are prompted to select either Windows collations or binary collations. Your choice of collations affects the data comparison and sort-order behaviors of your instance of Microsoft SQL Server.

There may be many instances when you need to display data to people in different countries and want locale-appropriate sorting. In both SQL Server 2000 and 2005, you can specify collations in expressions. This powerful feature allows you to sort in a particular manner so that the ORDER BY clause can be language-specific.

For example, the following query on the AdventureWorks database sorts the Person.Address table by the city in their address. The Lithuanian sort order provides a good example of collation differences because the rules about how the letter Y sorts are so striking.

SELECT *
FROM Person.Address
ORDER BY City COLLATE Lithuanian_CI_AI

The following example assumes that Table1 has no explicit column-level collations. In this case, both columns are compared by using the Turkish sort order.

SELECT   *
FROM  Table1
WHERE Field1 = Field2 COLLATE Turkish_ci_ai

For an explanation of how collations are used in comparisons, see the "Rules of Precedence for Collations" section, later in this paper.

COLLATE Keyword

A collation can be defined at the database level, the column level, or in an expression. To specify the collation for a database, column, or character expression, use the following syntax:

COLLATE [<Windows_Collation_name>|<SQL_Collation_Name]

If the column is not defined using a Unicode data type (ntext, nvarchar, nvarchar(max), nchar), the collation is converted into a code page.

The COLLATE keyword provides the option to use the following two types of collations:

Windows collations

These are defined by Windows. You can change options to specify case, accent, kana, and width sensitivity, and you can choose a binary sort order.

SQL collations

These collations are provided for backward compatibility. You cannot configure the sort order.

In general, try to use the Windows collations whenever you can. The following example presents a list of country codes and names, showing how sorting behavior can change depending on the collation. The upper part of the Query Window shows the list sorted by the default collations; the lower half of the Query Window shows the same data sorted by the Lithuanian collation.

In the default collation, shown first, Y comes between X and Z. In the Lithuanian collation, shown second, Y comes between I and J.

Figure 8. Effect of Lithuanian collation on sorting (Click on the picture for a larger image)

Rules of Precedence for Collations

Because you can specify collations at the level of the server, the database, columns, and in expressions, it is important to understand how the collations interact. Collation precedence determines how expressions that evaluate to a character string are collated, and determine the collation that is used by operators that use character string inputs but do not return a character string, such as LIKE and IN.

The collation precedence rules in SQL Server 2005 apply only to the character string data types: char, varchar, text, nchar, nvarchar, and ntext. Objects that have other data types do not participate in collation evaluations.

The comparison operators and the MAX, MIN, BETWEEN, LIKE, and IN operators, are collation-sensitive. The string used by the operators is assigned the collation label of the operand that has the higher precedence. The UNION operator is also collation-sensitive, and all string operands and the final result is assigned the collation of the operand with the highest precedence. The collation precedence of the UNION operands and result are evaluated column by column.

The assignment operator is collation-insensitive and the right expression is cast to the left collation.

The string concatenation operator is collation-insensitive. The two string operands and the result are assigned the collation label of the operand that has the highest collation precedence. The UNION ALL and CASE operators are collation-insensitive, and all string operands and the final results are assigned the collation label of the operand with the highest precedence. The collation precedence of the UNION ALL operands and result are evaluated column by column.

Because objects can have different collations at various levels, SQL Server 2005 introduces collation labels to help you manage the complex interaction of collations. A collation label names a category of objects that can take a collation. The collation rules are described in terms of interactions between collation labels.

If an expression references only one character string object, the collation label of the referenced object is used. If the expression references two operand expressions that have the same collation label, that becomes the collation label of the operand expression.

If a complex expression references two operand expressions with different collations, the collation label of the final result uses a series of rules to determine the precedence of the collations. For more information, see Collation Precedence (Transact-SQL) in SQL Server 2005 Books Online.

The following list describes the different types of collation labels. The list is followed by a chart that summarizes possible interactions of the collation labels.

Explicit

A collation is explicitly defined for a given expression, or explicitly cast to a specific collation (X) by using a COLLATE clause in the expression.

Implicit

A column is referenced. Even if the column were explicitly assigned a collation by using a COLLATE clause in the CREATE TABLE or CREATE VIEW statement, a column reference is classified as implicit.

Coercible-Default

The database-level collation is used for any Transact-SQL character string variable, parameter, literal, the output of a catalog built-in function, or the output of a built-in function that does not take string inputs but produces a string output.

If the object is declared in a user-defined function, stored procedure, or trigger, the object is assigned the default collation of the database in which the function, stored procedure, or trigger is created. If the object is declared in a batch, the object is assigned the default collation of the current database for the connection.

No Collation

When the value of an expression is the result of an operation between two strings that have conflicting collations of the implicit collation label, the expression result is defined as not having a collation.

	Explicit C1	Implicit C1	Default	No Collation
Explicit C2	Run-time Error	Explicit C1	Explicit C1	Explicit C1
Implicit C1	Explicit C2	No Collation	Implicit C1	No Collation
Default	Explicit C2	Implicit C2	Default	No Collation
No Collation	Explicit C2	No Collation	No Collation	No Collation

This table shows that the only time SQL Server cannot handle an expression is when you explicitly define two different, conflicting collations, or when you try to compare two items and no common ground for the comparison can be found.

For example, consider the following Transact-SQL statement for creating a table:

CREATE TABLE TestTab (
   id int,
   GreekCol nvarchar(10) COLLATE greek_ci_as,
   LatinCol nvarchar(10) COLLATE latin1_general_cs_as
   )
INSERT TestTab VALUES (1, N'A', N'a')
GO

This statement creates a table with one column that uses a case-insensitive, accent-sensitive Greek collation, and a second column that uses a case-sensitive, accent-sensitive General Latin1 collation.

You could attempt to use a query to compare the two explicitly:

SELECT *
FROM TestTab
WHERE GreekCol = LatinCol

However, the query returns an error:

Msg 468, Level 16, State 9, Line 1
Cannot resolve the collation conflict between "Latin1_General_CS_AS" 
and "Greek_CI_AS" in the equal to operation.

The error occurs because the server cannot compare two segments of text that have different collations. If, however, you use the COLLATE keyword to explicitly create an expression that allows them to be compatible, as shown in the following example, the query works:

SELECT *
FROM TestTab
WHERE GreekCol = LatinCol COLLATE greek_ci_as

Although LatinCol usually has a case-sensitive collation, the case-insensitive collation of the expression overrides case sensitivity and allows the uppercase and lowercase 'A' to be treated as equal.

Whether a function is collation-sensitive depends on the data type of the inputs. The CAST, CONVERT, and COLLATE functions are collation-sensitive for char, varchar, and text data types. If the input and output of the CAST and CONVERT functions are character strings, the output string has the collation label of the input string. If the input is not a character string, the output string is Coercible-default. In that case, it is assigned the collation of the current database for the connection, or the database that contains the user-defined function, stored procedure, or trigger in which the CAST or CONVERT is referenced.

For built-in functions that return a string but do not take a string input, the result string is Coercible-default and is assigned either the collation of the current database, or the collation of the database that contains the user-defined function, stored procedure, or trigger in which the function is referenced.

Limitations of the COLLATE Keyword

The COLLATE keyword and its related features are powerful and provide great flexibility to the international database developer; however, there are some limitations. These limitations, and their respective workarounds, are listed in this section.

Returning Less than a Full List of Collations

The fn_helpcollations function returns a list of all collations provided by SQL Server. The exception are three collations that are deprecated and do not show up in ::fn_helpcollations(). The Hindi collation is deprecated in SQL Server 2005 because the Windows 2000 Beta 2 sorting table is used in this SQL Server release. The collation still exists in the server, but it will not be supported in a future SQL Server release. The Hindi and Lithuanian_Classic collations are deprecated in SQL Server 2005. These collations still exist in the server, but they will not be supported in a future SQL Server release.

If you want to programmatically work with the collations, you must provide a user interface for this functionality yourself.

Implicit Conversion

Implicit conversion of non-Unicode character data between collations is non-deterministic. Implicit conversion of character strings to datetime or smalldatetime is also considered non-deterministic, unless the compatibility level is set to 80 or earlier.

In earlier versions of SQL Server, system object and system type names were matched against the collation of the master database. In SQL Server 2005, system object names and system type names are automatically cast to correspond to the collation of the current database. If references to these objects in your script or applications do not match how they appear in the catalog and the collation of the current database causes a mismatch, the script or application may fail. For example, the statement EXEC SP_heLP fails if the current database has a case-sensitive collation.

Issues with Defining Collation at the Column Level

Sometimes you might have a database that requires one sort order (for example, Latin1_General) and individual columns that require a different sort order (for example, Greek). It may also happen that the database contains multilingual data that needs to be sorted according to more than one collation, so you cannot define a single collation. In either case, the ability to define multiple collations, each of which can be indexed, means that you can access the data for each target language by specifying the correct collation. For example, you can more easily view Greek data by specifying the Greek collation.

Moreover, by specifying the additional collation, you can use indexed search in queries on this data. The ability to index the column for search is a critical requirement. In the example in the previous paragraph, using a COLLATE expression in the ORDER BY clause of a query provides the functionality needed to sort the data; however, the ordering will not be indexed, so query results will be returned more slowly in large datasets. For this reason, column-level collation makes sense only if you do not have monolingual data in a column, or if you denormalize your database to store different languages in different columns.

Collations and tempdb

The tempdb database is built every time that SQL Server is started. tempdb always has the same default collation as the model database. This collation is typically the same as the default collation of the instance. If you create a user database and specify a different default collation than model, the user database has a different default collation than tempdb. All temporary stored procedures or temporary tables are created and stored in tempdb. This means that all implicit columns in temporary tables and all Coercible-default constants, variables, and parameters in temporary stored procedures have collations that are different from comparable objects created in permanent tables and stored procedures.

This could lead to problems with a mismatch in collations between user-defined databases and system database objects. For example, assume that an instance of SQL Server 2005 uses the Latin1_General_CS_AS collation and you execute the following statements:

CREATE DATABASE TestDB COLLATE Estonian_CS_AS;
USE TestDB;
CREATE TABLE TestPermTable (PrimaryKey int PRIMARY KEY, Col1 nchar );
INSERT TestPermTable VALUES (1, 'a')

Next, create a temporary table with the same column declarations as TestPermTab, using these statements:

CREATE TABLE #TestTempTable (PrimaryKey int PRIMARY KEY, Col1 nchar )
INSERT #TestTempTable 
SELECT * FROM TestPermTable
GO

In this system, the tempdb database uses the SQL_Latin1_General_CP1_CI_AS collation (code page 1252), whereas TestDB and TestPermTable.Col1 use the Estonian_CS_AS collation (code page 1257).

Next, you query for the rows that should be the same, using these statements:

SELECT * FROM TestPermTable t1
JOIN #TestTempTable t2
ON t1.Col1 = t2.Col1

Because tempdb uses the default server collation and TestPermTable.Col1 uses a different collation, SQL Server returns this error:

"Msg 468, Level 16, State 9, Line 1
Cannot resolve the collation conflict between 
"SQL_Latin1_General_CP1_CI_AS" and "Estonian_CS_AS" in the equal to 
operation."

To prevent the error, you can use one of the following alternatives:

Specify that the temporary table column use the default collation of the user database, not tempdb. This enables the temporary table to work with similarly formatted tables in multiple databases, if that is required of your system.
Explicitly specify the correct collation for the #TestTempTable column. You can do this by using the COLLATE keyword in the column definition.

LCIDs and Collations

Windows use the locale ID (LCID) to define sort orders. If you do not know the LCID, you can use the default LCID by specifying 1024, or you can format your data by using Microsoft .NET formatting functions. In a Web-based ASP application, you can use the SetLocale function in Microsoft Visual Basic Scripting Edition (VBScript) to change the formatting to use the date/time, number, and currency formatting preferences of any locale. However, because there is no direct mapping between LCID and collations, you cannot determine the correct collation from an LCID.

This is inconvenient if you would like to be able automatically to assign the best sorting order for users, based on their locale ID. For example, if you have a multilingual Web site, with people visiting from different countries and viewing product information, you could map the HTTP_ACCEPT_LANGUAGE variables provided by their browsers to LCIDs for formatting date and currency values by using the Session.LCID property. For the sake of usability, you might also want to implement sorting using their locale.

To help build your own mapping function to work around this issue, you can use the list of collation designators provided in Setup. A collation designator provides a mapping between a specific collation and a standard collation. For a list of system locales and the recommended default collations, see Collation Settings in Setup in SQL Server 2005 Books Online.

The following are some examples of standard collations used for mapping:

Latin1_General can be used for the U.S. English character set (code page 1252).
Modern_Spanish can be used for all variations of Spanish that use the same character set as U.S. English (code page 1252).
Arabic can be used for all variations of Arabic that use the Arabic character set (code page 1256).
Japanese_Unicode can be used for the Unicode version of Japanese, which has a different sort order from Japanese but the same code page (932).

You can also specify the sort order to use with the collation designator you have selected. Binary is the fastest sorting order and is case-sensitive. If a binary sort order is selected, you cannot use the case-sensitive, accent-sensitive, kana-sensitive, or width-sensitive options.

ISO Strings and Collations

In VBScript, you can obtain the HTTP_ACCEPT_LANGUAGE variable with a script such as the following:

Dim stLang
stLang = Request.ServerVariables("HTTP_ACCEPT_LANGUAGE")

Because many Web developers use this value for locale information, the SetLocale function in VBScript was designed to accept this value directly, in addition to accepting LCID values. This means that you do not have to go through the intermediate step of mapping the value to an LCID. For each string that represents a locale, such as "en-us" (English-United States) or "vi" (Vietnamese), you must map to the appropriate collation, such as Latin1_General, or a Vietnamese collation.

Custom Collations

Developers frequently want to define their own collations. However, because all of the new SQL Server collations are derived from Windows collations, it is not possible to create new collations.

Moreover, a collation must be designed to define the method for sorting every defined character in the Unicode standard, and there is no user interface to create new collations.

Collation and Supplementary Characters

Supplementary characters can be used in ordering and comparison operations only if you use one of the _90 collations. These comparisons are based only on code points, and not on other linguistically meaningful ways. Be careful when you use supplementary characters in operations such as ORDER BY, GROUP BY, and DISTINCT, especially when supplementary and non-supplementary characters are included in the same operation.

Because supplementary characters are stored as pairs of two double-byte Unicode characters, the LEN() function returns the value 2 for each supplementary character that is contained in the argument string. Similarly, the CHARINDEX and PATINDEX functions misrepresent the occurrence of supplementary characters inside character strings, and the NCHAR function returns a character that represents only one half of the supplementary character pair. Converting a binary or varbinary value to a supplementary character also produces only one half of a supplementary character pair.

The LEFT, RIGHT, SUBSTRING, STUFF, and REVERSE functions may split any supplementary character pairs and lead to unexpected results.

In SQL Server 2005, it is possible to bypass the limitations of the SQL Server system functions when working with supplementary characters by using CLR user-defined functions. For more information about how to create and use supplementary character-aware functions using managed code in the .NET Framework common language runtime, see the samples that are provided with SQL Server 2005. One sample, the StringManipulate sample, demonstrates supplementary character-aware string processing and demonstrates the implementation of the LEN(), LEFT(), RIGHT(), SUBSTRING() and REPLACE() string functions, but with additional supplementary character-aware capability to handle both Unicode and supplementary character strings. The sample includes scripts in either C# or Visual Basic. NET, and also a Transact-SQL script for loading the assembly and creating the functions in SQL Server. For more information about the samples that are available, see Creating CLR Functions in SQL Server 2005 Books Online.

Server-Client Communication (Data-Access Technologies)

In database applications, it is not enough to use only SQL Server tools, such as SQL Server Management Studio. Typically, the server interacts with other servers or clients, using one or more data-access standards. In this context, the application or layer that connects to SQL Server is called the client. For the purposes of this discussion, there are two types of clients:

Unicode clients—SQL Native Client, OLE DB, and ODBC versions 3.7 and later (MDAC 2.8 and later)
Non-Unicode clients—ODBC version 3.7 and earlier, and DB-Library (MDAC 2.xx)

This section presents the data-access technologies that are available, and briefly discusses the issues that you may encounter when working with multilingual data in a client-server environment.

DB-Library

Although the SQL Server 2005 Database Engine still supports connections from existing applications using the DB-Library and Embedded SQL APIs, it does not include the files or documentation needed to do programming work on applications that use these APIs. A future version of the SQL Server Database Engine will drop support for connections from DB-Library or Embedded SQL applications. Moreover, there is no Unicode version of DB-Library.

Therefore, do not use DB-Library to develop new applications. Also, remove any dependencies on DB-Library when modifying existing applications. Instead of these APIs, use the SQLClient namespace or an API such as OLE DB or ODBC.

SQL Server 2005 does not include the DB-Library DLL required to run these applications. To run DB-Library or Embedded SQL applications you must have available the DB-Library DLL from SQL Server version 6.5, SQL Server 7.0, or SQL Server 2000.

SQL-DMO

SQL Distributed Management Objects (SQL-DMO) is a COM layer that encapsulates SQL Server database and replication management. Because it is COM, the same rules that applied to ADO and OLE DB apply to SQL-DMO. SQL-DMO also has properties that can be used for features mentioned earlier, such as the Collation property on the SQLServer2, Database2, Column2, SystemDateType2, and UserDefinedDataType2 objects.

OLE DB

OLE DB is the central component of the Microsoft Data-Access Components (MDAC). OLE DB is based on COM, and thus all of the strings are Unicode BSTRs (UTF-16 in Windows XP, Windows Server 2003, and Windows 2000; UCS-2 in all other operating systems). However, versions of MDAC earlier than MDAC 2.8 SP1 do not support named instances. Applications may not be able to connect to named instances of SQL Server 2005. To take advantage of the features of SQL Server 2005, upgrade to MDAC 2.8 SP1.

Version	Data-Access Library
SQL Server 7.0	MDAC version 2.1
SQL Server 2000	MDAC version 2.6
SQL Server 2005	MDAC version 2.8

For SQL Server, the provider is the Microsoft OLE DB Provider for SQL Server (SQLOLEDB). Data is converted to Unicode as needed, using the collation of the actual data.

ADO

Microsoft ActiveX Data Objects is a Visual Basic- and scripting-friendly interface that acts as a wrapper around OLE DB. It is also a COM component and thus has the same level of support for Unicode. There is no way to decouple ADO and OLE DB in a way that would allow conversions between the two; therefore, when problems exist they are always at the OLE DB layer.

ODBC

Some versions of ODBC support Unicode. One important issue with non-Unicode data is the way that the data is translated between code pages or to and from Unicode when ODBC is used. When you use a version of ODBC earlier than 3.7, data is converted from Unicode to ANSI using the default system code page. Even if you install a later version of ODBC, Jet 3.5 will do the same conversion.

There are two possible settings of the SQL_COPT_SS_TRANSLATE attribute when sent to SQLSetConnectAttr:

SQL_XL_OFF

The driver does not translate characters from one code page to another in character data exchanged between the client and the server.
SQL_XL_ON

The driver translates characters from one code page to another in character data exchanged between the client and the server. The driver automatically configures the character translation, determining the code page installed on the server and that in use by the client.

By default, SQL_XL_ON is the attribute that is used. You can also set it using the SQL-DMO TranslateChar method of the SQLServer object. Usually, this default provides the desired behavior (which is to turn auto_translate on) any time you are dealing with non-Unicode data.

If you turn auto-translate off, you must understand the consequences. Developers often disable auto translation on the assumption that this will allow the client and server to communicate without having to match code pages. However, this interaction is not supported. Moreover, to perform sorts and scans of non-Unicode data that has been defined with a Windows collation, SQL Server converts the data to Unicode before performing the sort. Therefore, if auto translation is disabled, the translated Unicode data may not be translated correctly back to its original non-Unicode code page when it is retrieved on the client side.

Microsoft SQL Native Client

Microsoft SQL Native Client (formerly SQL Native Client) was designed to provide a simplified method of gaining native data access to SQL Server using either OLE DB or ODBC. It combines OLE DB and ODBC technologies into one library, and it provides a way to innovate and evolve new data-access features without changing the current MDAC components, which are now part of the Microsoft Windows platform.

If you do not need to use any of the new features of SQL Server 2005, there is no need to use the SQL Native Client OLE DB provider; you can continue using your current data-access provider, which is typically SQLOLEDB. If you are enhancing an existing application and need to use the new features of SQL Server 2005, use the SQL Native Client OLE DB provider.

SQL Native Client uses components in MDAC, but is not explicitly dependent on a particular version of MDAC.

For new applications, if you're using a managed programming language such as Microsoft Visual C#.NET or Visual Basic.NET, and you need to access the new features of SQL Server 2005, use the .NET Framework Data Provider for SQL Server that is part of the .NET Framework for Visual Studio 2005. This gives you the most robust data-access component for working with SQL Server 2005.

If you are developing a COM-based application and need to access the new features of SQL Server 2005, use SQL Native Client. SQL Native Client supports access to previous SQL Server databases starting with SQL Server 7.0 and newer versions.

For existing OLE DB and ODBC applications, the primary issue is whether or not you need to access the new features of SQL Server 2005. If you have a mature application that does not need the new features of SQL Server 2005, you can continue to use MDAC. However, return values of data type varchar(max), nvarchar(max), varbinary(max), xml, UDT, or other large object types cannot be returned to client versions earlier than SQL Server 2005. If you wish to use these types as return values, you must use SQL Native Client.

Both MDAC and SQL Native Client provide native data access to SQL Server databases, but SQL Native Client was specifically designed to expose the new features of SQL Server 2005, while maintaining backward compatibility with earlier versions. In addition, although MDAC contains components for using OLE DB, ODBC, and ActiveX Data Objects (ADO), SQL Native Client implements only OLE DB and ODBC.

For more information about the difference between SQL Native Client and MDAC, see Updating an Application to SQL Native Client from MDAC in SQL Server 2005 Books Online.

To take advantage of new features introduced in SQL Server 2005, existing applications that use ActiveX Data Objects (ADO) should use the SQL Native Client OLE DB provider as their data-access provider.

Unicode and Non-Unicode Server and Client Configurations

This section presents some issues to consider when working with legacy systems that are not fully Unicode-compliant.

Unicode Server and Unicode Client

This is the ideal configuration. By keeping data in Unicode throughout the process, you can guarantee the best performance and also protect retrieved data from corruption. This is the case with ADO and OLE DB, or with SQL Server Native Client.

Unicode Server and One or More Non-Unicode Clients

In this type of configuration, you may not have any problems storing data, but there is a serious limitation when it comes to bringing the data to the client and using it. The client code page must be used to convert the Unicode data, at some point.

An example of this scenario is connecting to a SQL Server 2005 database from a computer that is using DB-Library. DB-Library is an interface that allows C applications to access SQL Server. DB-Library has not been upgraded significantly since SQL Server 6.5. Therefore, it has the same limitations as SQL Server 6.5: Data can be based on only one code page, the default OEM code page for the system. You can also choose whether locale information will be based on the locale settings of the client system or not. The SQL Server Client Network Utility provides two non-exclusive options for how DB-Library converts information: automatic ANSI to OEM conversion, and use of international settings. Both of these options are selected by default.

Because you cannot handle data on other code pages, the only time DB-Library should be used in the data layer is in legacy systems that deal with a subset of data from SQL Server. This technology is provided for backward compatibility only, but remains an option if you need to support multilingual data in legacy applications.

Other clients that are not Unicode-enabled include earlier versions of Microsoft Office Access. If you connect to a SQL Server database using an older version of Office Access, Unicode may be converted to question marks, as shown in Figure 9.

Bb330962.intlissues2005fig09(en-US,SQL.90).gif

Figure 9. Japanese table names, shown as question marks

This happens because the default behavior in ODBC 3.7 was to automatically convert to the system code page; however, because the Japanese characters were not included in the code page (1252) of the U.S. English client computer, they are replaced by question marks.

Moreover, it is not possible to connect to these tables; connection attempts will result in an error message. Jet and ODBC try to connect to a table named dbo.????, which will fail because such a table does not exist. This error will occur with any data not in the code page of the client system.

Once data has been converted to the wrong code page and been replaced by question marks, there is no way for it to ever be converted back. For example, if you use a non-Unicode client to connect to a table containing Korean data in Unicode, question marks replace the Korean text characters, as seen in Figure 10.

Bb330962.intlissues2005fig10(en-US,SQL.90).gif

Figure 10. Example of characters that cannot be viewed in non-Unicode client database

All three of the legacy clients (DB-Library, ODBC, and Jet 3.5) share the problem of being unable to handle Unicode, except for those characters that can be converted to the default system code page. Avoid using these components for client development unless you know that data can be contained in the default system code page. For more information, see Managing Data Conversion Between a Unicode Server and a Non-Unicode Client in SQL Server 2005 Books Online.

Also, because COM provides the entry points to SQL Server data within Office Access, the regional settings of the client are used to interpret the meaning of the data. This can affect the input of date/time values, numbers, and currency values. For more information about this issue, see the "Dealing with COM's Locale Interference" section, later in this paper.

If you use Office Access as a front end to SQL Server, we recommend that you upgrade your Office Access front end to use the new capabilities and Unicode support in Microsoft Office 2003 Access or Microsoft Office System 2007 Access. Online Help for Office Access provides guidance on how to upgrade existing Office Access applications. Note that the ADP format was discontinued after Microsoft Office Access 2000.

For information about how to optimize your Office Access application to improve performance and updatability, see the white paper Optimizing Microsoft Office Access Applications Linked to SQL Server in the MSDN Library.

Non-Unicode Server and Unicode Client

This is a poor configuration for multilingual data, because you will not be able to store the data on the server. Because both SQL Server 2000 and SQL Server 2005 are Unicode-aware, you are unlikely to encounter this configuration unless a linked server is defined to a SQL Server 6.5 database. You can receive valid data from the instance of SQL Server 6.5, but do not insert any data that is not included in that server's default code page.

Conversion of Multilingual Data from Earlier Versions of SQL Server

Some legacy applications, which may have been created before full Unicode support was implemented in SQL Server, use custom encoding schemes to handle multilingual data. To preserve the data, you can use the bulk copy program (bcp utility) to save the data out as binary, to prevent conversion errors. Then use bulk copy to re-import the data using the appropriate code page using the -C command-line parameter. For more information about the bcp utility, see the Using Command-Line Utilities with Multilingual Data section of this paper.

Multilingual Data in the User Interface

The SQL Server administration and management tools have been updated to improve multilingual data support. This section describes some of the changes that you will see in SQL Server 2005.

General UI changes

The new interface for SQL Server Management Studio provides the flexibility to host a variety of text and query editors, management dialog boxes, wizards, designers, and browser in the same shell. Within this shell, you can specify the session language and control display of the following elements:

The language that will be used for error and other system messages
The format of date and time data
The names of days and months, including abbreviations
The first day of the week
Currency symbols and formats

To change the font used in SQL Server Management Studio, on the Tools menu, select Options, point to Environment, and then select Fonts and Colors Page to customize font style, size, and color display settings for individual tool windows that have output panes in SQL Server Management Studio. If you change the text, the changes do not take effect during the session in which you make them. However, you can evaluate the effects of the change by opening another instance of SQL Server Management Studio.

There are 34 languages available for use as session settings. For a list of languages, see sys.syslanguages (Transact-SQL) in SQL Server 2005 Books Online.

You can also change the text and display options for working with Integration Services packages or Analysis Services multidimensional and data-mining objects by using Business Intelligence Development Studio. Use the Environment pages in the Options dialog box to configure options that include the language used in the environment for the user interface and for online Help, fonts, and the keyboard mapping scheme. After you have configured the development environment to your liking, your configurations are saved to a *.suo file in the solution folder, which you can reuse.

Note To view supplementary characters, you must install a font that supports the extensions for these characters. For more information, see Surrogates and Supplementary Characters in the MSDN Library.

To view the list of languages that are available for full-text indexing and querying operations, use sys.fulltext_languages. This catalog view contains one row per language. Each row provides an unambiguous representation of the available full-text linguistic resources that are registered with Microsoft SQL Server. The name or LCID can be specified in the full-text queries and full-text index DDL.

To set the session language from the server side, use SET LANGUAGE.
The session language can also be set on the client side by using OLE DB, ODBC, or ADO.NET.
For OLE DB, use the SSPROP_INIT_CURRENTLANGUAGE property.
For ODBC, use the LANGUAGE keyword. For more information, see SQLConfigDataSource in the MSDN Library.
For ADO.NET, use the Current Language parameter of the ConnectionString object.

Support for Complex Scripts

SQL Server 2005 supports inputting, storing, changing, and displaying complex scripts throughout the database engine and tools. Complex scripts include bidirectional scripts, such as Arabic, scripts whose characters change shape depending on their position, such as Indic, and Thai, and languages that require internal dictionaries to recognize words because there are no breaks between them, such as Thai.

Database applications that interact with SQL Server must use controls that support complex scripts. Standard Windows Forms controls that are created in managed code are complex script-enabled. However, the files required to support complex scripts have to be installed on your computer through the regional and language settings. For more information, see Complex Script Support in SQL Server 2005 Books Online.

Formatting in Query Designer

In the Query Designer, for the most part you can enter information in the Grid pane that matches the default regional settings of the computer, or you can explicitly use the CONVERT function to cause a string in an arbitrary format to be handled.

There are a few design limitations to be aware of, when using the regional settings:

Long data formats are not supported.
Currency symbols should not be entered in the Grid pane, although the U.S. dollar sign ($) can optionally be used. Either way, the currency symbol retrieved from the regional settings will be used in the Results pane.
Unary minus always appears on the left side without parentheses, regardless of the regional settings. Thus, -1 should be represented as -1 instead of 1- or (1) or any other valid variation that may be specified in the Regional Options dialog box.

These limitations are necessary to allow a certain amount of worldwide support in the Query Designer, but will not block most efforts to use locale-specific data.

Because any information entered in the Grid pane will be translated to a locale-independent format in the SQL pane, "03.09.65" on a Standard German computer will be translated to { ts ' 1965-09-03 00:00:00 }. All data entered directly into the SQL pane should be in this format or otherwise include an explicit CONVERT call.

Sort Order

The sorting of data displayed in the Results pane is not influenced by the Regional Settings; instead, the collation rules control how an ORDER BY clause is interpreted.

Multilingual Transact-SQL

When you send an SQL statement that contains multilingual data to the server, two issues affect whether the data is conveyed correctly to the server: (1) the encoding of the SQL statement itself, and (2) the encoding of string literals inside the statement.

Encoding SQL Statements

There are restrictions on the characters that can be used in Transact-SQL statements. You can enter DBCS characters for literals and database object names, aliases, parameter names, and parameter marker characters. However, you cannot use DBCS characters for SQL language elements such as function names or SQL keywords. Thus, when you type SQL keywords, such as SELECT, you must use the half-width characters "SELECT" instead of the Japanese full-width characters, " Bb330962.intlftrql200525(en-US,SQL.90).gif ".

If the SQL string uses Unicode (for example, any SQL string using ADO), you can encode any type of character. If the string does not use Unicode (for example, a string in a non-Unicode batch file or .sql file), the conversion must be made at some point. When you do this conversion, you must plan carefully and take into consideration the default system code page of the computer on which the conversion is done to prevent problems in multilingual applications.

Encoding String Literals in SQL Statements

Unicode string constants that appear in code executed on the server, such as in stored procedures and triggers, must be preceded by the capital letter N. If a string literal is not in Unicode (marked with the N prefix), the string is converted to the default code page of the database, which may not recognize certain characters. With multilingual data it is best to use a Unicode data type and Unicode string literals.

The following example shows a Unicode string designated by placing an "n" prefix in front of the string:

SELECT n''

Figure 11 shows how this string, which is the Hindi word for the Hindi language, would appear on a computer that had the correct Hindi font installed.

Bb330962.intlissues2005fig11(en-US,SQL.90).gif

Figure 11. Example of Hindi characters on Hindi client

If the "n" prefix were not placed before the string, it would be converted to ??????. This also happens with data that has a code page that does not match the system defaults.

Warning Remember that the use of the "n" prefix both in string literals and in data types (nchar, nvarchar, and ntext) to represent Unicode data is specific to SQL Server. The ANSI-92 SQL specification defines the National character data types but does not specify them as having to be Unicode. The ANSI-99 SQL specification does discuss using a set of Unicode types with a "u" prefix (for example, utext, uchar, and uvarchar). These data types are not available in SQL Server.

Use the N prefix in front of strings to make sure that they are passed as Unicode. This helps avoid unintended problems with conversions when using the default system code page of the server.

If your application does not send Unicode data to SQL Server and the client's ANSI code page matches the SQL Server code page, there is no need to prefix string constants with N. You will not experience data loss as a result of omitting the prefix.

If you are using a linked server with SQL Server 7.0, there are additional issues to consider. SQL Server 7.0 allows you to select a Unicode collation during installation that is distinct from the sort order; in some cases this causes operations involving strings prefixed with N to produce different results from those that do not have the prefix. For more information about this issue, see the Microsoft Help and Support site.

String-Handling Functions

Transact-SQL has built-in string-handling functions that have important multilingual considerations:

ASCII

Returns the code point of the first character in a string using the current default system code page. If the character is not on that code page, a 63 is returned (the code point for a question mark). This is similar to the Asc() function in Visual Basic and VBScript.

CHAR

Returns a character, given the ANSI code point. This is essentially the inverse operation of the ASCII function; it is similar to the Chr() function in Visual Basic and VBScript. If the code point is not in the 0–255 range, it returns Null.

NCHAR

Returns a character given its Unicode code point. NCHAR is the Unicode equivalent of the CHAR function. It is similar to the ChrW() function in Visual Basic and VBScript.

UNICODE

Returns the Unicode code point of the first character in a string. UNICODE is the Unicode equivalent to the ASCII function. It is similar to the AscW() function in Visual Basic and VBScript.

An example of the NCHAR function can be found earlier in this paper (see the "Date/Time Types: datetime, smalldatetime" section), where it was used to add the RLM (right-to-left mark) in front of a Hijri date. Adding the RML mark ensures that the date is formatted in the expected manner.

Locale Support in SQL Server 2005

SQL Server 2000 included specific locale support for 33 different languages. In SQL Server 2005, support was added for all languages that are supported by Microsoft Windows operating systems. Appendix B contains a list of all the supported languages, together with their locale IDs.

You can enumerate these languages and information about them by using the sp_helplanguage stored procedure. Note that although every version of SQL Server will store full information about many of the items listed below, you will not get translated system messages for all locales unless you have installed a localized version of the product.

The information on languages is stored in the syslanguages table, except for messages, which are stored in sysmessages.

Language Settings

Every instance of SQL Server must have a default language that it uses to handle items such as date formats and messages. This information is stored for each login to the server that is created. Although definition of a default language for the server is initially done during setup, it can be overridden at the server level in the Advanced tab in the SQL Server Properties dialog box, as shown in Figure 12. (In SQL Server 2000, this option is on the Server Settings tab.)

Figure 12. Setting the default language for the server (Click on the picture for a larger image)

You can also use the sp_configure stored procedure to change the default language. For example, the following statement changes the language to Italian:

sp_configure "language", 6

The default language setting can be overridden on a per-login basis by using the sp_addlogin stored procedure or by using the Login Properties dialog box shown in Figure 13.

Figure 13. Setting the default language for a login (Click on the picture for a larger image)

The default language can be overridden at the session level, using the SET LANGUAGE statement, as shown in the following examples:

Bb330962.intlftrql2005code(en-US,SQL.90).gif

Working with Strings

Individual data-access methods provide their own methods of specifying the language setting outside of a SET LANGUAGE call:

ADO supports a provider-specific language keyword in the ConnectionString.
OLE DB can set the provider-specific SSPROP_INIT_CURRENTLANGUAGE property.
ODBC can specify a language in the data source definition or in a LANGUAGE keyword in the connect string.
DB-Library can use dblogin to allocate a LOGINREC, and then DBSETNATLANG to specify a language setting.

The language settings (whether specified at the server, login, or session level) affect these items:

Messages
Date/Time
First day of week
Currency and currency symbols
Month/day names and abbreviated month names

Messages

SQL Server 2000 and SQL Server 2005 support having multiple, language-specific copies of system error strings and messages. These messages are stored in the sysmessages table of the master database. When you install a localized version of SQL Server, these system messages are translated for that language version. By default, you also have the U.S. English set of these messages. Use Set Language to specify the language of the server session. By default, messages are sent using the language of the installed version.

When SQL Server sends a message to a connection, it uses the localized message if the language ID Set matches one of the language IDs in the msglangid column of the sysmessages table. These IDs are in decimal format and represent the locale ID (LCID) of the message. If there is no message in the sysmessages table with the same LCID, the U.S. English messages are sent.

You can add a language-specific user-defined message to the sysmessages table by using @lang parameter of the sp_addmessage system stored procedure. The error number must be greater than 50,000, and you must specify the @lang parameter as the language, or named alias that maps to the LCID, for the message. Because multiple language-specific, system-error strings and messages can be installed on the server, the value of language specifies the language in which the messages should be written to the sysmessages table. When language is omitted, the language is the default language of the server session. The supported language definitions are stored in the master.dbo.syslanguages.

If you must install multiple language versions of sysmessages, see the Microsoft Help and Support site for additional steps and requirements. Installations of different language versions of SQL Server instances on the same computer are not supported. For example, German SQL Server 2005 and Portuguese (Brazil) SQL Server 2005 are not supported as multiple instances on the same computer.

Note Many of the system tables from earlier releases of SQL Server are implemented as a set of views in SQL Server 2005. These views are known as compatibility views, and they are meant for backward compatibility only. The compatibility views expose the same metadata that was available in SQL Server 2000. However, they do not expose any of the metadata related to features introduced in SQL Server 2005.

Date/Time

Each language has an appropriate default for the short date format: mdy, dmy, or ymd. You can override this at the connection level by using the SET DATEFORMAT setting. To retrieve the default short date, use the sp_helplanguage stored procedure. The value is stored in the dateformat column.

First Day of Week

The first day of the week varies among different locales; among the 33 languages in the syslanguages table, it varies between 1 (Monday) and 7 (Sunday). This information can be retrieved by using the sp_helplanguage stored procedure. The value is stored in the datefirst column.

Currency and Currency Symbols

Any column of money or smallmoney type can include a currency symbol. The symbol does not have to be the one specified in the Regional Options dialog box, and can be any of the characters shown in the following table.

Bb330962.intlftrql2005table(en-US,SQL.90).gif

Note that the Euro sign (hexadecimal value 20AC) is not the same as Bb330962.intlftrql200524(en-US,SQL.90).gif , the Euro-Currency (ECU) sign (hexadecimal value 20A0). If you attempt to use in a money value, an error will occur.

Month/Day Names and Abbreviated Month Names

The names of the months and days are in the syslanguages table. You can retrieve the by using the sp_helplanguage stored procedure, and view the following columns:

months

A comma-delimited list of month names, January through December

shortmonths

A comma-delimited list of abbreviated month names, January through December

days

A comma-delimited list of the days of the week, Monday through Sunday

Dealing with COM's Locale Interference

Although SQL Server has some very powerful features when it comes to handling date/time and currency values, if you use any COM service such as ADO to access the server, you must make allowances for the operation of this service.

For example, you might have problems getting Visual Basic to recognize that a number value prefaced by any of the currency symbols shown in the preceding table is a currency value. You might also have problems getting COM to properly use date/time values stored in strings.

To properly troubleshoot these problems, you must understand exactly when your application is converting a string to a date/time or currency value. First, determine if the conversion occurs on the client or the server, and then you can decide which rules apply.

The Office Access 2000 ADP is an example of a client that converts date, time, or currency values. Because Access is working through OLE DB, all operations from Office Access 2000 are governed by the COM rules that use the client computer's regional settings.

OLE DB providers such as the Microsoft OLE DB Provider for SQL Server will properly convert a valid COM date to and from a SQL Server date. It is best not to rely on date formats in strings when you can avoid it. This type of functionality can break as the locale for the client changes.

SQL Server 2005 Integration Services

SQL Server 2005 Integrations Services replaces Data Transformation Services in SQL Server 2000. Integration Services provides greatly improved ability to import, export, and transform multilingual data. Integration Services supports the parsing and manipulation of multilingual data, supports all Windows locales, and provides special comparison options for sorting and comparing string data.

Locale Settings in Integration Services Packages

Integration Services supports locales at the level of the package object, container, task, and data flow component. You can also set the locale of event handlers, and of some sources and destinations.

A package can use multiple different locales. For example, the package may use the English (United States) locale while one task in the package uses the German (Germany) locale and another task uses the Japanese (Japan) locale.

You can use any Windows locale in an Integration Services package. You set the locale when you construct the package. Unless the package uses configurations to update locale properties, the package is guaranteed to behave the same when deployed to computers that may use different regional and language options than the development environment.

Setting Locale

Integration Services does not use the code page to infer locale-specific rules for sorting data or interpreting date, time, and decimal data. Instead, the transformation reads the locale that is set by the LocaleID property on the data-flow component, Data Flow task, container, or package.

By default, the locale of a transformation is inherited from its Data Flow task, which in turn inherits from the package. If the Data Flow task is in a container such as the For Loop container, it inherits its locale from the container.

You can also specify a locale for a Flat File connection manager and a Multiple Flat Files connection manager.

You can change the locale or the code page of an object by using the editor for the data flow object or control flow object. For the package object, you must set properties such as LocaleID in the Properties window.

Bb330962.intlissues2005fig14(en-US,SQL.90).gif

Figure 14. Specifying the locale for a package

Note that the CodePage property is not available for the package object. This is because the code page is not relevant for the package object, but only for sources, destinations, or transformations.

Working with Flat File Data

When working with flat file source data, it is important to understand how the Flat File connection manager interprets the flat file data. If the flat file source is Unicode, the Flat File connection manager defines all columns as [DT_WSTR] with a default column width of 50. If the flat file source is ANSI-encoded, the columns are defined as [DT_STR] with a column width of 50. You will probably have to change these defaults to make the string column types more appropriate for your data. To do this, look at the data type of the destination to which the data will be written. Then, choose the correct type in the Flat File connection manager.

Be aware that when importing data from or exporting data to flat files, certain properties of the package, the connection, and the data transformation tasks are set automatically, based on the locale. You can easily override these settings, and may need to do so if your data contains information that may be affected by locale settings. Failure to change the locale for the flat file source or destination may have unintended consequences.

For example, the tutorial for Integration Services that was released with SQL Server 2005 contains a scenario for importing data from a flat file. The file contains a BirthDate column that uses one of the default date formats for the EN-US locale. If you run this tutorial on a computer that uses a different default locale, the package that imports data may fail to run because it cannot parse the date data. In this case, to make the package work, all you have to do is edit the package, the data flow objects, and the related sources or destinations, to ensure that they all use the correct locale.

Figure 15. Specifying the locale and code page for a flat file (Click on the picture for a larger image)

Parsing Text Data

When Integration Services loads text from a flat file or other source, it parses the text according to a set of predefined parsing routines: either fast parse or standard parse.

Fast parse is a fast, simple set of parsing routines that does not support locale-specific data type conversions, and supports only the most frequently used date and time formats. Fast parse does not perform locale-specific parsing, does not recognize special characters in currency data, and cannot convert hexadecimal or scientific representation of integers.

Standard parse is a rich set of parsing routines that support all the data type conversions that are provided by the Automation data-type conversion APIs available in Oleaut32.dll and Ole2dsip.dll.

When you create a data flow, you can specify whether the transformation output columns use the quicker, but locale-insensitive, fast-parsing routines that Microsoft SQL Server 2005 Integration Services (SSIS) provides or the standard locale-sensitive parsing routines.

If the data flow in the package requires locale-sensitive parsing, standard parse is recommended instead of fast parse. For example, fast parse does not recognize locale-sensitive data that includes decimal symbols such as the comma, date formats other than year-month-date formats, and currency symbols.

For more information about the different data formats that are supported by fast parse and standard parse, see Parsing Data in SQL Server 2005 Books Online.

Fast parse is specified at the column level. In the Flat File source and the Data Conversion transformation, you can specify fast parse on output columns. Inputs and outputs can include both locale-sensitive and locale-insensitive columns. However, fast parse is available only when using the Flat File source or the Data Conversion transformation.

Comparison Options

The locale provides the basic rules for comparing string data in a data flow. For example, the locale specifies the sort position of each letter in the alphabet. However, these rules may not be sufficient for the comparisons that you want to perform, and Integration Services supports a set of advanced comparison options that go beyond the comparison rules of a locale. For example, if you choose to ignore nonspacing characters, "a" and "á" are equivalent for comparison purposes.

String comparisons are an important part of many of the transformations performed by Integration Services, and string comparisons are also used in the evaluation of expressions in variables and property expressions. For example:

The Conditional Split transformation can use string comparisons in expressions to determine to which output to send the data row.
The Derived Column transformation can use string comparisons in expressions to generate new column values.
The Sort, Aggregate, Fuzzy Grouping, and Fuzzy Lookup transformations can be customized to change the way strings are compared at the column level. You might specify that a lookup ignore case but treat accented and unaccented characters as different.

Depending on the data and the configuration of the transformation, the following processing may occur during the comparison of string data:

Converting data to Unicode. If the source data is not already Unicode, the data is automatically converted to Unicode before the comparison occurs.
Using locale to apply locale-specific rules for interpreting date, time, decimal data, and sort order.
Applying comparison options at the column level to change the sensitivity of comparisons.

Changing Comparison Options

Integration Services supports a set of advanced comparison options that can be set at the column level. For example, one of the comparison options lets you ignore nonspacing characters. The effect of this option is to ignore diacritics such as the accent, which makes "a" and "á" identical for comparison purposes.

You can set the following comparison options in the Sort, Aggregate, Fuzzy Grouping, and Fuzzy Lookup transformations:

Ignore case
Ignore kana type
Ignore character width
Ignore nonspacing characters
Ignore symbols
Sort punctuation as symbols

The Fuzzy Grouping and Fuzzy Lookup transformations also include the FullySensitive option for comparing data. This comparison flag, available only in the Advanced Editor, indicates that all the comparison options should apply.

Data Sources and Destinations

In most cases, Integration Services can identify the correct code page from the data source. If Integration Services provides an unexpected code page, or if the package accesses a data source by using a provider that does not supply sufficient information to determine the correct code page, you can specify a default code page in the OLE DB source and the OLE DB destination. The default code pages are used instead of the code pages that Integration Services provides.

Files do not have code pages. Instead, the Flat File and the Multiple Flat Files connection managers that a package uses to connect to file data include a property for specifying the code page of the file. The code page can be set at the file level only, not at the column level.

Depending on the operations that the transformation performs and the configuration of the transformation, string data may be converted to the DT_WSTR data type, which is a Unicode representation of string characters.

String data that has the DT_STR data type is converted to Unicode using the code page of the column. Integration Services supports code pages at the column level, and each column can be converted by using a different code page.

Explicit Conversion of String Data to Unicode

Data flows in packages extract and load data. Data flows can access data in heterogeneous data stores, which may use a variety of standard and custom data types. Within a data flow, Integration Services sources do the work of extracting data, parsing string data, and converting data to an Integration Services data type. Subsequent transformations may parse data to convert it to a different data type, or create column copies with different data types. Expressions used in components may also cast arguments and operands to different data types. Finally, when the data is loaded into a data store, the destination may parse the data to convert it to a data type that the destination uses.

When data enters a data flow in a package, the source that extracts the data converts the data to an Integration Services data type. Numeric data is assigned a numeric data type, string data is assigned a character data type, and dates are assigned a date data type. Other data, such as GUIDs and binary large object blocks (BLOBs), are also assigned appropriate Integration Services data types. If data has a data type that is not convertible to an Integration Services data type, an error occurs.

For more information about data conversion in data flows, see Integration Services Data Types in SQL Server 2005 Books Online.

The Data Conversion transformation provides you with the ability to control conversions—to explicitly change code pages, change format, or convert the data in an input column to a different data type. You can apply multiple conversions to a single input column. The converted data can replace the original data, or you can copy it to a new output column for use in downstream components.

For example, a package can extract data from sources in multiple languages, and then use this transformation to convert columns to the data type required by the destination data store.

Using Data Conversion transformation, you can perform the following types of data conversions:

Change the data type.
Set the column length of string data and the precision and scale on numeric data.
Specify a code page.

The Character Map transformation is another component that is useful for performing conversions on character data. It applies string functions, such as conversion from lowercase to uppercase, to data that has a string data type.

The Character Map transformation can convert column data in place or add a column to the transformation output and put the converted data in the new column. You can apply different sets of mapping operations to the same input column and put the results in different columns. For example, you can convert the same column to uppercase and lowercase and put the results in two different columns.

The Character Map transformation supports the following mapping operations:

Operation	Description
Byte reversal	Reverses byte order.
Full width	Maps half-width characters to full-width characters.
Half width	Maps full-width characters to half-width characters.
Hiragana	Maps katakana characters to hiragana characters.
Katakana	Maps hiragana characters to katakana characters.
Linguistic casing	Applies linguistic casing instead of the system rules. Linguistic casing refers to functionality provided by the Win32 API for Unicode simple case mapping of Turkic and other locales.
Lowercase	Converts characters to lowercase.
Simplified Chinese	Maps traditional Chinese characters to simplified Chinese characters.
Traditional Chinese	Maps simplified Chinese characters to traditional Chinese characters.
Uppercase	Converts characters to uppercase.

More than one operation can be performed in a transformation. However, some mapping operations are mutually exclusive. In most case these exclusions are common sense: for example, you cannot map katakana to hiragana at the same time that you are mapping hiragana to katakana. For more information on the mappings that are exclusive, see Character Map Transformation in SQL Server 2005 Books Online.

Note Mapping can, under some circumstances, cause data to be truncated. For example, truncation can occur when single-byte characters are mapped to characters with a multibyte representation. When you perform data conversions, use data viewers to monitor the transformations in the data, and use error outputs to direct truncated or error data to separate output. For more information, see Handling Errors in Data in SQL Server 2005 Books Online.

Support for Supplementary Characters in SSIS

Integration Services handles surrogate pairs transparently. That is, unless a data source, data destination, or transformation is changing the size of a string, surrogate-pair characters should be handled the same way as any other Unicode characters. Functions or features that return the size of a string should treat surrogate-pair characters as two separate, undefined Unicode characters. Features that do this are the LEN and SUBSTRING string functions.

To display supplementary characters, the user must customize Business Intelligence Development Studio to use a font that supports the extensions for these characters. However, because supplementary characters are only supported by SQL Server as data and not metadata, the places where you might need to display these characters are limited to data previews, the WHERE clause of queries, and so forth.

Dynamically Changing Locale with Configurations

If a package must use different locales when deployed to different servers, you can create configurations that provide the updated locales to use when the package is run.

Configurations are attribute-value pairs that let you set run-time properties and variables from outside of the development environment. By incorporating configurations into development, you can create packages that are flexible and easy to both deploy and distribute. Integration Services offers the following configuration types:

XML configuration file
Environment variable
Registry entry
Parent package variable
SQL Server table

To provide additional flexibility, Integration Services supports the use of indirect configurations. This means that you use an environment variable to specify the location of the configuration, which in turn specifies the actual values.

An XML configuration file can include configurations for multiple properties, and under some conditions can store configurations for multiple packages.

Several new Integration Services tutorials have been released that walk you through the process of creating and testing configurations, and then building a deployment utility to create new packages that contain the dynamic configurations. We recommend that you try these tutorials to learn more about how you can dynamically update code pages, conversion options, or other international settings by using configurations and then deploying packages.

Creating a Simple ETL Package Tutorial

The Creating a Simple ETL Package Tutorial steps you through the following tasks.

Creates a simple Integration Services package and then adds looping.
Introduces the Package Configuration Wizard and describes the format for XML configurations.
Creates a package configuration that changes the directory that contains the text files to loop through.
Tests the configuration by having you modify the value of the variable from outside of the development environment and point the modified property to a new sample data folder. When you run the package again, the configuration file updates the file directory.

Deploying Packages Tutorial

You will find the Deploying Packages Tutorial useful for learning the following tasks.

Uses a combination of XML configuration files and indirect configurations to update the connection strings of log files, text files, and the locations of the XML and XSD files that the package uses at run time.
Creates a deployment bundle to use to install the packages. The deployment bundle consists of the package files and other items that you added to the Integration Services project, the package dependencies that Integration Services automatically includes, and the deployment utility that you built.
Run the Package Installation Wizard to install the packages and package dependencies. The packages are installed in the msdb SQL Server database, and the supporting files are installed in the file system.
Update the configuration to use new values that enable packages to run successfully in the new environment.

Using Command-Line Utilities with Multilingual Data

This section provides basic information about some of the command-line utilities provided with SQL Server 2005. The bcp utility has been updated to provide features useful in multilingual scenarios, and includes support for XML formats. sqlcmd is a new utility that simplifies scripting and allows you to control the code page of script files.

bcp

When you want to import data from or export data to a particular code page, you can still use the bcp utility. The bcp utility bulk copies data between an instance of Microsoft SQL Server and a data file in a user-specified format.

In SQL Server 2005, bcp enforces new data validation and data checks that might cause existing scripts to fail when they are executed on invalid data in a data file. For example, bcp now verifies:

Whether the native representation of float or real data types is valid.
If Unicode data has an even-byte length.

Note Forms of invalid data that could be bulk imported in previous versions of SQL Server might fail to load now. In previous releases, the failure did not occur until a client tried to access the invalid data. The reason for the added validation is that validating before the data is loaded minimizes surprises when you query the data after bulk load.

The bcp utility supports the following flags for lossless conversion of data:

Flag	Meaning	Explanation and notes
-C xxx	Code page specifier	xxx can specify a code page, ANSI, OEM, or RAW. This option is supported for compatibility with early versions of SQL Server. For SQL Server 7.0 and later. Microsoft recommends that you specify a collation name for each column in a format file. ANSI: Microsoft Windows (ISO 1252). OEM: Default code page used by the client. This is the default code page used if -C is not specified. RAW: No conversion from one code page to another occurs. This is the fastest option. code_page: Use only if the data contains char, varchar, or text columns with character values greater than 127 or less than 32.
-N	Use Unicode native format	Uses native (database) data types for all noncharacter data, and Unicode character format for all character data. This option offers a higher performance alternative to the -w option, and is intended for transferring data from one instance of SQL Server to another using a data file. It does not prompt for each field. Use this option when you are transferring data that contains ANSI extended characters and you want to take advantage of the performance of native mode. Cannot be used with SQL Server 6.5 or earlier versions.
-w	Use Unicode character format	Uses the Unicode character data format for all columns. This option does not prompt for each field; it uses nchar as the storage type, no prefixes, \t (tab character) as the field separator, and \n (newline character) as the row terminator. Cannot be used with SQL Server 6.5 or earlier versions.

You can also use format files and specify collations at the column level. If you do not specify -C, -N, or -w, bcp will ask for the collation and code page information for each column before performing the import or export. You are then prompted to save the information to a format file, as shown here:

9.0
9
1   SQLCHAR   0   11   ""1   au_id   Latin1_General_CI_AI
2   SQLCHAR   0   40   ""   2   au_lname   Latin1_General_CI_AI
3   SQLCHAR   0   20   ""   3   au_fname   Latin1_General_CI_AI
4   SQLCHAR   0   12   ""   4   phone   Latin1_General_CI_AI
5   SQLCHAR   0   40   ""   5   address   Latin1_General_CI_AI
6   SQLCHAR   0   20   ""   6   city   Latin1_General_CI_AI
7   SQLCHAR   0   2   ""   7   state   Latin1_General_CI_AI
8   SQLCHAR   0   5   ""   8   zip   Latin1_General_CI_AI
9   SQLBIT   0   1   ""   9   contract   ""

For information about format files, see Understanding Non-XML Format Files in SQL Server 2005 Books Online.

SQL Server 2005 introduces an XML format file for bcp. XML format files have many advantages: They contain descriptions of the corresponding table columns, and they contain the data types of target columns. XML files are easy to read, create, and extend. You can create an XML format file by using bcp; you use the –x option. After the format file has been created, you can use the format file to import or export data using bcp and the –f option.

The following table lists the data types that can be imported or exported by using bcp. The collation is the default one specified for the column, which may have been inherited from the database or the server.

Type	Full name
C	char
T	text
I	int
S	smallint
T	tinyint
F	float
M	money
B	bit
D	datetime
X	binary
I	image
D	smalldatetime
r	real
M	smallmoney
n	numeric
E	decimal
W	nchar
W	ntext
U	uniqueidentifier

Note that the varchar and nvarchar data types are not listed in the table. For the bcp utility, the char and nchar types should be used in their place, respectively.

Finally, bcp supports the -R flag for "Regional Enable." This flag has the same effect as the ODBC Use regional settings option, which can affect the interpretation of date/time, number, and currency data that is stored in nontext fields.

sqlcmd

The sqlcmd utility is new in SQL Server 2005, and provides functionality formerly provided by a combination of osql and isql, batch files, and VBScript. This utility uses OLE DB to execute Transact-SQL batches.

You can run ad-hoc queries interactively from a command prompt window, or you can execute a script that contains Transact-SQL statements or system procedures. You can create script files and then run the scripts at the command prompt, in Query Editor in SQLCMD mode, in a Windows script file or in an operating system (Cmd.exe) job step of a SQL Server Agent job.

You can use the –f option to specify the input code page and output code page.

The code page number must be a numeric value that specifies an installed Windows code page. If no code pages are specified, sqlcmd uses the code page of the current system for both input and output files. However, you can use the –u option to direct that the output file be stored in Unicode format, regardless of the format of the input file.

The following examples demonstrate this syntax. The MyScript.sql script file contains the following query:

'USE AdventureWorks; GO; SELECT * FROM 
Production.vProductAndDescription; ORDER BY CultureID;'

The first example runs the query and saves the output to a file in Unicode. The second example runs the same query but saves the output file in the code page that has the ID 950. An error can occur if the code page is not available.

sqlcmd -i C:\Temp\MyScript.sql -o C:\temp3\MyOutput1.txt -u
sqlcmd -i C:\ Temp\MyScript.sql -f1252 -o C:\temp3\MyOutput2.txt -f950

sqlcmd automatically recognizes both Big Endian and Little Endian Unicode input files. If the -u option has been specified, the output will always be Little Endian Unicode.

Multiple input files are assumed to be of the same code page. Unicode and non-Unicode input files can be mixed.

You can create and edit sqlcmd scripts by using the Query Editor in SQL Server Management Studio (SSMS). However, SSMS uses the Microsoft .NET SqlClient for execution, whereas sqlcmd uses the OLE DB provider when you run a script from the command line. Therefore, you might see different behavior when you execute the same query in SQL Server Management Studio in SQLCMD Mode and in the sqlcmd utility.

International Features in SQL Server Analysis Services

Microsoft SQL Server 2005 Analysis Services supports all languages that are supported by Microsoft Windows operating systems. Analysis Services uses Windows language identifiers to specify the selected language for Analysis Services instances and objects. A Windows language identifier corresponds to a combination of Windows primary language and sublanguage identifiers. For example, if you select English (United States) during Setup, the corresponding Windows language identifier, 0x0409 (or 1033), is specified in the Language element of the configuration settings file for the Analysis Services instance.

Translations

Besides specifying the default language and collation used by an Analysis Services instance, you can also provide multi-language support for individual Analysis Services objects, including cubes, measure groups, dimensions, hierarchies, and attributes, by defining a translation associated with an Analysis Services object.

You can use Business Intelligence Development Studio to define the translations for the caption, description, and account types for an Analysis Services database. When you use translations, the client applications can receive Analysis Services objects in a different language and collation.

The default language and collation settings for an Analysis Services instance specify the settings used for data and metadata if a translation for a specific language identifier is not provided for an Analysis Services object, or if a client application does not specify a language identifier when connecting to an Analysis Services instance.

Note that although translations provide display information for the names of Analysis Services objects, the identifiers are not translated. To ensure portability across multiple languages, use the identifiers and keys for Analysis Services objects instead of the translated captions and names. For example, use member keys instead of member names for Multidimensional Expressions (MDX) statements and scripts.

If a client application requests information in a specified language identifier, the Analysis Services instance attempts to resolve data and metadata for Analysis Services objects to the closest possible language identifier. If the client application does not specify a default language, or specifies the neutral locale identifier (0) or process default language identifier (1024), Analysis Services uses the default language for the instance to return data and metadata for Analysis Services objects.

If the client application specifies a language identifier other than the default language identifier, the instance iterates through all available translations for all available objects. If the specified language identifier matches the language identifier of a translation, Analysis Services returns that translation. If a match cannot be found, Analysis Services attempts to use one of the following methods to return translations with a language identifier closest to the specified language identifier.

For the following language identifiers, Analysis Services attempts to use an alternate language identifier if a translation for the specified language identifier is not defined.

Specified language identifier	Alternate language identifier
3076: Chinese (Hong Kong SAR, PRC)	1028: Chinese (Taiwan)
5124: Chinese (Macao SAR)	1028: Chinese (Taiwan)
1028: Chinese (Taiwan)	Default language
4100: Chinese (Singapore)	2052: Chinese (PRC)
2074: Croatian	Default language
3098: Croatian (Cyrillic)	Default language

For all other specified language identifiers, Analysis Services extracts the primary language of the specified language identifier and retrieves the language identifier indicated by Windows as the best match for the primary language. If a translation for the best match language identifier cannot be found, or if the specified language identifier is the best match for the primary language, the default language is used.

Collations in Analysis Services

Analysis Services uses Windows collations to specify the selected collation for Analysis Services instances and objects. When you specify a Windows collation for Analysis Services, the Analysis Services instance uses the same code pages and sorting and comparison rules as an application that is running on a computer for which you have specified the associated Windows locale. For example, the French Windows collation for Analysis Services matches the collation attributes of the French locale for Windows.

There are more Windows locales than there are Windows collations defined for Analysis Services. The names of Windows locales are based on a language identifier, such as English, and a sublanguage identifier, such as United States or Australia. However, many languages share common alphabets and rules for sorting and comparing characters.

For example, 33 Windows locales, including all the Portuguese and English Windows locales, use the Latin1 code page (1252) and follow a common set of rules for sorting and comparing characters. The SQL Server Windows collation Latin1_General, based on this code page and associated sorting rules, supports all 33 of these Windows locales. Also, Windows locales specify attributes that are not covered by Analysis Services Windows collations, such as currency, date, and time formats. Because countries and regions such as Australia and the United States have different currency, date, and time formats, they require different Windows collations. They do not require different Analysis Services Windows collations, however, because they have the same alphabet and rules for sorting and comparing characters.

Although multiple language identifiers can be specified for Analysis Services objects, the same Analysis Services Windows collation is used for all Analysis Services objects, with a single exception, regardless of language identifier. The single exception to this functionality is the CaptionColumn property of an attribute in a database dimension, for which you can specify an Analysis Services Windows collation to collate the members of the specified attribute.

Sort-Order Options in Analysis Services

If the same language is used by all the users for your Analysis Services instance, select the collation that supports the specified default language for your instance. If multiple languages are used, choose a collation that best supports the requirements of the various languages. For example, if the users of your instance generally speak western European languages, select the Latin1_General collation.

Several sort-order options can be applied to the specified Analysis Services Windows collation to additionally define sorting and comparison rules based on case, accent, kana, and width sensitivity. The sort-order options are the same for the SQL Server database engine: you can define case and width sensitivity, accent sensitivity, and kana type sensitivity.

If you use the English (United States) language identifier (0x0409, or 1033) as the default language for the Analysis Services instance, you can get additional performance benefits by setting the EnableFast1033Locale configuration property. This is an advanced configuration property that is available only for that language identifier. Setting the value of this property to true enables Analysis Services to use a faster algorithm for string hashing and comparison.

Note One feature that is not supported is the ability to have different collations in different parts of hierarchical data structures that make up a data warehouse. Although not generally useful in the analysis of data, there are cases such as partitioning data alphabetically where this might be useful. In such cases, you need to come up with another way to partition the data that explicitly defines the ordering and partitions and does not assume letter order. This would also be very useful in the display of hierarchical data. If you need such functionality, you will sort the subset of the data returned from the OLAP source.

Support for Multiple Currencies

Analysis Services provides support for currency conversion in cubes that contain multiple currencies. Before you can define a currency conversion in a cube, you must first define at least one currency dimension, at least one time dimension, and at least one rate measure group. From these objects, the Business Intelligence Wizard can retrieve the data and metadata used to construct the reporting currency dimension and create an MDX script to provide currency conversion functionality.

Before you can implement currency conversions, you must define the following:

Term	Definition
Pivot currency	The currency against which exchange rates are entered in the rate measure group.
Local currency	The currency used to store transactions on which the convertible measures are based. The local currency can be identified by either of the following attributes: A currency identifier in the fact table stored with the transaction. This is commonly the case with banking applications, where the transaction itself identifies the currency used for that transaction. A currency identifier associated with an attribute in a dimension table that is then associated with a transaction in the fact table. This is commonly the case in financial applications, where a location or other identifier, such as a subsidiary, identifies the currency used for an associated transaction.
Reporting currency	The currency to which transactions are converted from the pivot currency.
Exchange rate direction	Indicates whether the multiplier is applied to the pivot currency or the sample currency. The combination of exchange rate direction and conversion type determines the operation performed on measures to be converted.
Converted members	Specifies the scope of the currency conversion calculation.
Conversion type	Indicates how transactions are stored, and how many currencies to use in conversion. One-to-many: Stores as pivot currency. Converts to multiple reporting currencies. Many-to-one: Stores as local currency. Converts to pivot currency, which is used as sole reporting currency. Many-to-many: Stores as local currency. Converts to pivot currency, and then to multiple reporting currencies.

You can then use the Business Intelligence Wizard to generate an MDX script that uses a combination of data and metadata from dimensions, attributes, and measure groups to convert measures containing currency data. For more information about how to use the wizard to create currency conversions, see Working with Currency Conversions (SSAS) in SQL Server 2005 Books Online.

Working with Date and Time Values in SSAS

When you perform month and day-of-week comparisons and operations, use the numeric date and time parts instead of date and time part strings. Date and time part strings are determined in part by the language identifier specified for the instance, and the current translation provided by the instance for the members of the time dimension.

You should take advantage of the various date and time functions in MDX for working with time dimensions. The Visual Basic for Applications (VBA) date and time functions are also very useful for returning numeric date and time parts instead of the name strings. For client applications, use the literal date and time part strings when returning results that will be displayed to a user, because the strings are usually more meaningful than a numeric representation. However, do not code any logic that depends on the displayed names being in a specific language.

XML Support in SQL Server 2005

SQL Server 2005 introduces SQLXML 4.0, which delivers the same basic functionality as SQLXML 3.0, plus additional updates to accommodate new features introduced in Microsoft SQL Server 2005.

SQL Server 2005 supports the xml data type, which lets you store XML documents and fragments in a SQL Server database. An XML fragment is an XML instance that is missing a single top-level element. You can create columns and variables of the xml type and store XML instances in them, provided the stored representation does not exceed 2 GB. The xml data type and associated methods help integrate XML into the relational framework of SQL Server.

XML and Unicode

The SQL Server 2005 xml data type implements the ISO SQL-2003 standard xml data type. Therefore, it can store well-formed XML version 1.0 documents and also so-called XML content fragments with text nodes. The system checks that the data is well-formed, does not require that the column be bound to XML schemas, and rejects data that is not well-formed.

In SQL Server 2005, you can generate XML in various ways: you can type cast strings, use the SELECT statement with the FOR XML clause, use bulk load, or assign XML to a string constant. For information about how the string and binary types interact with the XML document encoding, and how the XML parser behaves, see Generating XML Instances in SQL Server 2005 Books Online.

XML documents can be encoded with different encodings; for example, UTF-8, UTF-16, or Latin-1252. There are several ways you can specify an encoding in XML:

If you format data as XML in an ADO Stream object and then persist the stream, you can specify an output encoding and the proper encoding will be marked in the XML-formatted data.
You can specify an output encoding in a URL.
XML templates can specify an encoding.

Even if you do not use any of these methods, by default Unicode is supported and will work properly.

Note The SQL Server 2005 xml data type handles white space differently from the description of white space in the XML 1.0 specification. In SQL Server, white space inside element content is considered insignificant if it occurs inside a sequence of white-space-only character data delimited by markup, such as begin or end tags, and is not entitized. (CDATA sections are ignored.) This is because the XML parser in SQL Server recognizes only a limited number of DTD subsets, as defined in XML 1.0. For more information about the limited DTD subsets supported in SQL Server 2005, see CAST and CONVERT (Transact-SQL) in SQL Server 2005 Books Online.

Updategrams

You can modify (insert, update, or delete) data in a Microsoft SQL Server 2005 database from an existing XML document by using an updategram or the OPENXML Transact-SQL function. Updategrams are designed to work well with multilingual text and support methods for specifying encoding.

The OPENXML function modifies a database by shredding the existing XML document and providing a rowset that can be passed to an INSERT, UPDATE, or DELETE statement. With OPENXML, operations are performed directly against the database tables. Therefore, OPENXML is most appropriate wherever rowset providers, such as a table, can appear as a source.

Like OPENXML, an updategram allows you to insert, update, or delete data in the database; however, an updategram works against the XML views provided by the annotated XSD (or an XDR) schema. For example, the updates are applied to the XML view provided by the mapping schema. The mapping schema, in turn, has the necessary information to map XML elements and attributes to the corresponding database tables and columns. The updategram uses this mapping information to update the database tables and columns.

Data Access for XML Clients

To use SQLXML 4.0, you must also have the Microsoft SQL Native Client, and MDAC 2.6 or later. When deploying an application which is dependent on SQL Native Client, you will need to redistribute SQL Native Client with your application. Unlike Microsoft Data-Access Components (MDAC), which is now a component of the operating system, SQL Native Client is a component of SQL Server 2005. Therefore, it is important to install SQ Server Native Client in your development environment and redistribute SQL Native Client with your application.

In previous versions of SQLXML, HTTP-based query execution was supported using SQLXML IIS virtual directories and the SQLXML ISAPI filter. In SQLXML 4.0, these components have been removed as similar and overlapping functionality is now provided with native XML Web services in SQL Server 2005. If your solution requires the enhanced data typing features of SQL Server 2005 such as the xml data type or user-defined data types (UDTs) and Web-based access, you will need to use another solution, such as the SQLXML managed classes, or another type of HTTP handler, such as Native XML Web Services for SQL Server 2005.

If you do not require these type extensions of SQL Server 2005, you can continue to use SQLXML 3.0 to connect to SQL Server 2005 installations. The SQLXML 3.0 ISAPI support will work against SQL Server 2005 but does not support or recognize the xml data type or UDT type support introduced in SQL Server 2005.

For an introduction to the complete set of XML features in SQL Server 2005, see Using XML in SQL Server and XML Best Practices in SQL Server 2005 Books Online.

For information about how to work with XML in client applications, see Working with the xml Data Type in Applications and SQLXML 4.0 Programming in SQL Server 2005 Books Online.

Enhancements to Full-Text Search

SQL Server 2005 Full-Text Search includes a major upgrade of the Microsoft Search service (MSSearch) to version 3.0. These improvements include the following features:

Massively improved full-text index population performance
One instance of MSSearch 3.0 per instance of SQL Server, allowing side-by-side install of the full-text engine on separate instances
Query enhancements, including the ability to specify language
Support for full-text indexing and search of xml data type

Multiple Instances of the Full-Text Engine

The Microsoft Full-Text Engine for SQL Server (MSFTESQL) service is based on the Microsoft Search (MSSearch) service. In SQL Server 2005, a separate service is installed per instance of Microsoft SQL Server. This means that SQL Server no longer has to share the MSSearch service with other server products that use the MSSearch service. Having separate instances of the full-text engine also simplifies the process of managing and updating servers. You can install other products that use the MSSearch service, such as Microsoft SharePoint Portal Server, without worrying that a newer version of the MSSearch service will affect the behavior of full-text search.

Upgrading to the new version of MSSearch is done during setup of SQL Server 2005. An instance of SQL Server 2005 is set up side-by-side with the old version of SQL Server, and data is migrated. If the old version of SQL Server had Full-Text Search installed, a new version of Full-Text Search is installed automatically.

Format Changes in Full-Text Catalogs

The format of full-text catalogs has changed significantly in SQL Server 2005. This change requires that all full-text catalogs from previous versions of SQL Server must be rebuilt. Rebuilding occurs automatically and does not prevent upgrade from completing. As a result, the rebuilding of full-text catalogs might continue after upgrade.

When the full-text catalog is rebuilt, a new folder is created under the path where the full-text catalog was originally located. The files associated with the rebuilt full-text catalogs are listed in this new folder.

Language-Specific Options for Full-Text Searching

In Microsoft SQL Server 2005, full-text queries can use languages other than the default language for the column to search full-text data. As long as the language is supported and its resources are installed, the language specified in the LANGUAGE language_term clause of the CONTAINS, CONTAINSTABLE, FREETEXT, and FREETEXTTABLE queries will be used for word breaking, stemming, and thesaurus and noise-word processing.

When you create a full-text index on a column, you must choose the language for the column. By choosing a language, you define options for how text is tokenized and then indexed by the full-text engine. These options include the following:

Use of a language-specific word breaker, or the neutral word breaker
Language-based stemming

Appendix E contains a list of languages that support language-specific stemming or word-breaking. All other languages use the language neutral engine. In this case, word breaking uses white space for separation, and no stemming is performed.

Word breakers are designed mainly to process plain text; therefore, if you have any type of markup (such as HTML) on your text, you may not get great linguistic accuracy during indexing and search. In that case, you have two choices: The preferred method is simply to store the text data in varbinary(max) column and indicate its document type, so that it may be filtered.

If you cannot store your data in a varbinary(max) column, you might consider using the neutral word breaker. If possible, you can also add markup data (such as the 'br' tag in HTML) to your noise word lists. When data is not stored in a varbinary(max) column, no special filtering is performed. Instead, the text is passed through the word breaking component as-is.

The Unicode collation locale identifier setting is used against all data types eligible for full-text indexing (such as char, nchar, and so on). If you have specified the sort order of a char, varchar, or text column using a language setting different from the Unicode collation locale identifier language, the Unicode collation locale identifier is still used during full-text indexing and querying of the char, varchar, and text columns.

Query Enhancements for Full-Text Searching

SQL Server 2005 full-text query syntax for the CONTAINS, CONTAINSTABLE, FREETEXT, and FREETEXTTABLE clauses all support a new LANGUAGE <lcid> parameter that can be used to override the column-default full-text language with a clause-level language. This clause-level language dictates which word breaker, stemmer, thesaurus, and noise word list to use for all terms of the full-text query. The LCID is specified either as a numeric value or as a string. The LANGUAGE <lcid> parameter's value can also be specified as a Transact-SQL variable if it is used in a full-text query clause in a stored procedure.

For example, you can pass in the search string and the query-level language value as a parameter to a stored procedure that obtains the search string from a text input field, and the query language from a list of supported languages.

For sample code that demonstrates the use of the LCID parameter, see SQL Server 2005 Full-Text Search: Internals and Enhancements in the MSDN Library. For more information about how to use the LCID parameter in full-text queries, see FREETEXT (Transact-SQL), FREETEXTTABLE (Transact-SQL), CONTAINS (Transact-SQL), or CONTAINSTABLE (Transact-SQL) in SQL Server 2005 Books Online.

In past releases, mapping of DocIds to full-text key values was performed in the full-text engine. In SQL Server 2005, this process has been moved into the database engine where more efficient and consistent caching strategies may be utilized. This migration, in addition to enhancements to the query engine, should speed up full-text queries over previous releases.

Full-Text Search of XML Data

SQL Server 2005 introduces a new xml data type that allows you to store an XML fragment or document. Full-text search in SQL Server now supports the creation of full-text indexes on, and full-text queries against, data stored in the xml data type.

Queries are at the granularity of the column value. Full-text predicates issued against a full-text indexed XML column return rows where the specified search string exists anywhere in the content of the column. For more information, see Querying varbinary(max) and xml Columns in SQL Server 2005 Books Online.

Interacting with Other Database Products

When working with other database systems, the most important task for international applications is to determine the code page and rules of that system. Many of the data-access methods for SQL Server use COM, and thus use Unicode data. SQL Native Client also uses Unicode. Therefore, how well the other database products support Unicode is the main piece of information you need to determine how well an international application will run between them.

For example, other database products such as Oracle and Sybase SQL Server support Unicode using UTF-8 encoding. Usually this will not affect you because the data must be converted to UTF-16 using ADO/OLE DB before you ever see the information. But be aware of the difference, if you try to interact directly with data in such products.

Some products support the National character data types, but do not treat them as Unicode data types. For such databases, nchar and nvarchar are fields that you could use to store Japanese data in an otherwise U.S. English database. The use of the National character data types for specifying Unicode text is specific to Microsoft SQL Server.

Therefore, if you use commands such as OPENROWSET, run queries against another product, or import data from other systems, it is very important that you know how international text information is being handled in the other database. For more information, see Connecting to the SQL Server Database Engine and Distributed Queries in SQL Server 2005 Books Online.

For information about connecting to external data sources in SQL Server 2005 for application development, see the following topics in SQL Server 2005 Books Online:

SQL Server Analysis Services (SSAS)	Working with Data Sources (Analysis Services)
SQL Server Integration Services (SSIS)	Integration Services Connections
SQL Server Reporting Services	Data Sources Supported by Reporting Services
SQL Server Replication	Heterogeneous Database Replication
Overview of the JDBC Driver	Data Access from CLR Database Objects

Conclusion

Microsoft SQL Server 2005 builds on the strong set of international features that were provided in SQL Server 2000, and adds improved, more consistent Unicode support as well as integration with the multilingual capabilities of the Windows operating system and the common language runtime (CLR). SQL Server 2005 combines a strong foundation for international database support with a wide range of multilingual development tools for rapid development and robust deployment of international solutions for finance, e-commerce, and global communication.

For more information:

SQL Server Developer Center

Acknowledgments

This paper was originally written by Michael Kaplan for SQL Server 2000, with the assistance of Michael Kung, Program Manager for SQL Server; Peter Carlin, Development Manager for the SQL Server relational engine; and Fernando Caro, lead international program manager. Additional information was provided by Michael Rys, Euan Garden, Fadi Fakhouri, James Howey, and Margaret Li.

Many of the hard working people on the Windows 2000 team were also very helpful, both for providing information on how basic issues such as collation and locale support are supposed to work and of course for providing the original data on which SQL Server 7.0 and 2000's collation support is based. I would especially like to thank Julie Bennett, Cathy Wissink, and John McConnell for being around to answer questions and provide encouragement for getting the word out!

Updated information regarding the features in SQL Server 2005 was provided by the SQL Server User Education team, with the assistance of Fernando Caro and Nobuhiko Kishi.

Appendix A: Collation Suffixes

Suffix	Meaning
_BIN	binary sort
_CI_AI	case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive
_CI_AI_WS	case-insensitive, accent-insensitive, kanatype-insensitive, width-sensitive
_CI_AI_KS	case-insensitive, accent-insensitive, kanatype-sensitive, width-insensitive
_CI_AI_KS_WS	case-insensitive, accent-insensitive, kanatype-sensitive, width-sensitive
_CI_AS	case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive
_CI_AS_WS	case-insensitive, accent-sensitive, kanatype-insensitive, width-sensitive
_CI_AS_KS	case-insensitive, accent-sensitive, kanatype-sensitive, width-insensitive
_CI_AS_KS_WS	case-insensitive, accent-sensitive, kanatype-sensitive, width-sensitive
_CS_AI	case-sensitive, accent-insensitive, kanatype-insensitive, width-insensitive
_CS_AI_WS	case-sensitive, accent-insensitive, kanatype-insensitive, width-sensitive
_CS_AI_KS	case-sensitive, accent-insensitive, kanatype-sensitive, width-insensitive
_CS_AI_KS_WS	case-sensitive, accent-insensitive, kanatype-sensitive, width-sensitive
_CS_AS	case-sensitive, accent-sensitive, kanatype-insensitive, width-insensitive
_CS_AS_WS	case-sensitive, accent-sensitive, kanatype-insensitive, width-sensitive
_CS_AS_KS	case-sensitive, accent-sensitive, kanatype-sensitive, width-insensitive
_CS_AS_KS_WS	case-sensitive, accent-sensitive, kanatype-sensitive, width-sensitive

Note Kana sensitivity is set by default to insensitive. In other words, by default, katakana and hiragana are treated as the same. Width sensitivity is also insensitive by default. In other words, by default, full-width and half-width characters are treated as the same.

Appendix B: Locales Supported in Windows

Bb330962.intlfeaturesappendix(en-US,SQL.90).gif

Appendix C: SQL-Specific Sort Orders

SQL-specific sort orders are also included in SQL Server 2000. These SQL-specific sort orders are shown in the following table. Many of these support some of the various parts of the suffixes described previously, but not all of the suffixes are supported.

SQL_1xCompat_CP850	SQL_Estonian_CP1257	SQL_Latin1_General_Pref_CP437
SQL_AltDiction_CP1253	SQL_Hungarian_CP1250	SQL_Latin1_General_Pref_CP850
SQL_AltDiction_CP850	SQL_Icelandic_Pref_CP1	SQL_Latvian_CP1257
SQL_AltDiction_Pref_CP850	SQL_Latin1_General_CP1	SQL_Lithuanian_CP1257
SQL_Croatian_CP1250	SQL_Latin1_General_CP1250	SQL_MixDiction_CP1253
SQL_Czech_CP1250	SQL_Latin1_General_CP1251	SQL_Polish_CP1250
SQL_Danish_Pref_CP1	SQL_Latin1_General_CP1253	SQL_Romanian_CP1250
SQL_EBCDIC037_CP1	SQL_Latin1_General_CP1254	SQL_Scandinavian_CP850
SQL_EBCDIC273_CP1	SQL_Latin1_General_CP1255	SQL_Scandinavian_Pref_CP850
SQL_EBCDIC277_CP1	SQL_Latin1_General_CP1256	SQL_Slovak_CP1250
SQL_EBCDIC278_CP1	SQL_Latin1_General_CP1257	SQL_Slovenian_CP1250
SQL_EBCDIC280_CP1	SQL_Latin1_General_CP437	SQL_SwedishPhone_Pref_CP1
SQL_EBCDIC284_CP1	SQL_Latin1_General_CP850	SQL_SwedishStd_Pref_CP1
SQL_EBCDIC285_CP1	SQL_Latin1_General_Pref_CP1	SQL_Ukrainian_CP1251
SQL_AltDiction_CP1253	SQL_Hungarian_CP1250	SQL_Latin1_General_Pref_CP850

Appendix D: Languages Supported in SQL Server 2005

You can view this same list in SQL Server 2005 by executing this statement:

select langid, alias, lcid, msglangid from sys.syslanguages

Name in English	Windows LCID	Message ID
English	1033	1033
German	1031	1031
French	1036	1036
Japanese	1041	1041
Danish	1030	1030
Spanish	3082	3082
Italian	1040	1040
Dutch	1043	1043
Norwegian	2068	2068
Portuguese	2070	2070
Finnish	1035	1035
Swedish	1053	1053
Czech	1029	1029
Hungarian	1038	1038
Polish	1045	1045
Romanian	1048	1048
Croatian	1050	1050
Slovak	1051	1051
Slovene	1060	1060
Greek	1032	1032
Bulgarian	1026	1026
Russian	1049	1049
Turkish	1055	1055
British English	2057	1033
Estonian	1061	1061
Latvian	1062	1062
Lithuanian	1063	1063
Brazilian	1046	1046
Traditional Chinese	1028	1028
Korean	1042	1042
Simplified Chinese	2052	2052
Arabic	1025	1025
Thai	1054	1054

Appendix E: Full-Text Languages Added in SQL Server 2005

This list is also available in International Considerations for Full-Text Search in SQL Server 2005 Books Online. For the complete list of languages available for full-text indexing and querying operations, use sys.fulltext_languages.

Unicode collation locale identifier	Language for full-text data storage
Chinese Bopomofo (Taiwan)	Traditional Chinese
Chinese Punctuation	Simplified Chinese
Chinese Stroke Count	Simplified Chinese
Chinese Stroke Count (Taiwan)	Traditional Chinese
Dutch	Dutch
English U.K.	English U.K.
French	French
General Unicode	English U.S.
German	German
German Phonebook	German
Italian	Italian
Japanese	Japanese
Japanese Unicode	Japanese
Korean	Korean
Korean Unicode	Korean
Spanish (Spain)	Spanish
Swedish/Finnish	Swedish

Appendix F: Collations Updated in SQL Server 2005

This list is also available in Collation Settings in Setup in SQL Server 2005 Books Online. For the complete list of collations supported in an instance of SQL Server 2005, use fn_helpcollations ().

Old collation name	New collation name
Japanese	Japanese_90
Chinese	Chinese_PRC_90 Chinese_PRC_Stroke Chinese_PRC_Stroke_90 Chinese_Taiwan_Bopomofo Chinese_Taiwan_Bopomofo_90 Chinese_Taiwan_Stroke Chinese_Taiwan_Stroke_90
Korean	Korean_90
Hindi	(deprecated in this release)
Indic	General_90

Share via