Collation Terminology

Article
12/03/2008

To make the best use of language support in SQL Server 2005, you should understand the terms defined in this topic.

Terms

Code page
Collation
Data type
Globalization
Locale
Reading order
Sort order
Unicode

Code Page

A code page is an ordered set of characters of a given script in which a numeric index, or code point value, is associated with each character. A Microsoft Windows code page is commonly referred to as a character set or a charset. Code pages are used to provide support for character sets and keyboard layouts used by different Windows locales.

Related Topics:Setting Client Code Pages

Collation

A collation specifies the bit patterns that represent each character in a data set. Collations also determine the rules that sort and compare data. SQL Server 2005 supports storing objects with different collations in a single database—each column in a SQL Server database can have its own collation. For non-Unicode columns, the collation setting specifies the code page for the data and, as a result, which characters can be represented. Data can be moved between Unicode columns seamlessly. Data moved between non-Unicode columns cannot be moved seamlessly, and must be converted by the current code page.

The result of a Transact-SQL statement can vary when the statement is run in the context of different databases that each has a different collation setting. Best practices include use of a standardized collation for your organization, if possible. Using a standard collation setting across all systems in your organization will help to eliminate the need to explicitly specify the collation in every character or Unicode expression. If you must work with objects that have different collation and code page settings, you must code your queries to consider the rules of collation precedence. For more information, see Collation Precedence (Transact-SQL).

The characteristics of a collation are language sensitive, case sensitive, accent sensitive, Kana sensitive, and width sensitive.

SQL Server 2005 collations include the following groupings:

Windows collations
Windows collations define rules for storing character data based on an associated Windows locale. For a Windows collation, comparison of non-Unicode data is implemented using the same algorithm as Unicode data. The base Windows collation rules specify which alphabet or language is used when dictionary sorting is applied, as well as the code page used to store non-Unicode character data. Both Unicode and non-Unicode sorting are compatible with string comparisons in a particular version of Windows. This provides consistency across data types within SQL Server, and it also provides developers with the ability to sort strings in their applications using the same rules that are used by SQL Server; that is, by calling the CompareStringW function of the Microsoft Win32 API. For more information, see Collation Settings in Setup.

Binary collations
Binary collations sort data based on the sequence of coded values defined by the locale and data type. A binary collation in SQL Server defines the language locale and the ANSI code page to be used, enforcing a binary sort order. Binary collations are useful in achieving improved application performance due to their relative simplicity. For non-Unicode data types, data comparisons are based on the code points defined in the ANSI code page. For Unicode data types, data comparisons are based on the Unicode code points. For binary collations on Unicode data types, the locale is not considered in data sorts. For example, Latin_1_General_BIN and Japanese_BIN yield identical sorting results when used on Unicode data.

Previous binary collations in SQL Server performed an incomplete code-point to code-point comparison for Unicode data, in that binary collations in previous versions of SQL Server compared the first character as WCHAR, followed by a byte-by-byte comparison. For backwards compatibility reasons, existing binary collation semantics will not be changed.

Binary collations in this release of SQL Server include a new set of pure code-point comparison collations. Customers can choose to migrate to the new binary collations to take advantage of true code-point comparisons, and they should utilize the new binary collations for development of new applications. The new BIN2 suffix identifies collation names that implement the new code-point collation semantics. In addition, a new comparison flag is added corresponding to BIN2 for the new binary sort. For more information, see Using Binary Collations.

SQL Server collations
SQL Server collations provide sort order compatibility with earlier versions of SQL Server. SQL Server collations are based on legacy SQL Server sort orders for non-Unicode data—for example, char and varchar data types—defined by SQL Server. The dictionary sorting rules for non-Unicode data are not compatible with any sorting routine provided by Windows operating systems, but sorting of Unicode data is compatible with a particular version of Windows sorting rules. Because SQL Server collations use different comparison rules for non-Unicode and Unicode data, you may see different results for comparisons of the same data, depending on the underlying data type. For more information, see Using SQL Collations.

Note

When you upgrade an instance of SQL Server, SQL Server collations can be specified for compatibility with existing instances of SQL Server. Because the default collation for an instance of SQL Server is defined during Setup, it is important to specify collation settings carefully when:
- Your application code depends in some way on the behavior of previous SQL Server collations.
- You are going to use SQL Server 2005 replication with existing installations of SQL Server 6.5 or SQL Server 7.0.
- You must store character data that reflects multiple languages.

SQL Server 2005 supports setting collations at the following levels of a SQL Server 2005 instance:

Server-level collations
The default collation of a SQL Server instance is set during Setup. The default collation of the instance also becomes the default collation of the system databases: master, model, tempdb, msdb, and distribution. After a collation has been assigned to any object other than a column or a database, you cannot change the collation except by dropping and re-creating the object. Instead of changing the default collation of a SQL Server instance, you can specify the collation at the time you create a new database or database column.

To query the server collation for a SQL Server instance, use the following Transact-SQL SERVERPROPERTY function:
```
SELECT CONVERT (varchar, SERVERPROPERTY('collation'))
```
To query the server for all available collations, use the following fn_helpcollations() built-in function:
```
SELECT * from ::fn_helpcollations()
```

Database-level collations
When a database is created, the COLLATE clause of the CREATE DATABASE statement can be used to specify the default collation of the database. If no collation is specified during database creation, the database is assigned the default collation of the model database. The default collation for the model database is the same as the default collation of the SQL Server instance.

The collation of a user database can be changed with an ALTER DATABASE statement like the following one:
```
ALTER DATABASE myDB COLLATE Greek_CS_AI
```
The current collation of a database can be retrieved by using a statement like the following one:
```
SELECT CONVERT (varchar, DATABASEPROPERTYEX('database_name','collation'))
```
Note

Altering the database-level collation does not affect user-, table-, or column-level collations.

Column-level collations
When creating a table, collations for each character-string column can be specified using the COLLATE clause of the CREATE TABLE statement. If no collation is specified during table creation, the column is assigned the default collation of the database.

The collation of a column can be changed with the ALTER TABLE statement like the following one:
```
ALTER TABLE myTable ALTER COLUMN mycol NVARCHAR(10) COLLATE Greek_CS_AI
```

Expression-level collations
Expression-level collations are set at the time a statement is run, and they affect the way a result set is returned. This allows sorting of a result so that the ORDER BY clause can be language-specific. Use a COLLATE clause like the following one to implement expression-level collations:
```
SELECT name FROM customer ORDER BY name COLLATE Latin1_General_CS_AI
```

Data Type

Data type is a definition that specifies a range of values, the operations that can be performed on the values, and the way that the values are stored in memory on your computer. Defining data types enables SQL Server to manipulate data in predictable ways. Non-Unicode character data types are char, varchar, and text. Unicode data types use the Unicode character representation; they are nchar, nvarchar, and ntext. Using Unicode data types in your applications is recommended, especially if you store character data that reflects multiple languages.

Globalization

Globalization is the process of developing a software application with features and code design that consider multiple spoken languages and locales. Globalized application design accommodates multiple locales and Unicode-supported languages for data input, processing, display, and output.

Related Topics:International Considerations for SQL Server

Locale

A locale is a set of information associated with a place or a culture—the name and identifier of the spoken language, the script used to write the language, and cultural conventions. SQL Server 2005 supports all 135 locales supported by Windows XP. Among them are five Chinese language locales (Hong Kong SAR, Macau SAR, the People's Republic of China, Singapore, and Taiwan); thirteen English language locales (Australia, Belize, Canada, Caribbean, Ireland, Jamaica, New Zealand, Philippines, South Africa, Trinidad, the United Kingdom, the United States of America, and Zimbabwe); and six French language locales (Belgium, Canada, France, Luxembourg, Monaco, and Switzerland).

The following table illustrates several differences between four common locales supported by Windows.

Locale	English (U.S.A.)	French (France)	Japanese	United Arab Emirates
Country/region	United States	France	Japan	United Arab Emirates
Language	English	French	Japanese	Arabic
Written scripts	Latin	Latin	Kana, kanji	Arabic
Reading order	Left to right	Left to right	Left to right	Right to left
Windows-defined code page	1252	1252	932	1256
Time format	1:00 pm	13:00	13:00	1:00 p
Calendar	Gregorian	Gregorian	Gregorian (Localized)	Gregorian (Localized)
Default paper size	U.S. Letter	A4	A4	A4
Decimal separator	.	,	.	,
List separator	,	;	,	;
Thousands separator	,	space	,	,

Reading Order

Reading order is the overall direction of an ordered sequence of text, relating to the word order, not the order of the entered characters. For example, using Arabic as the keyboard language means that new characters will always flow from right to left. When Latin is the keyboard language, new characters flow from left to right.

Sort Order

Sort order specifies the way that data values are sorted, affecting the results of data comparison. The sorting of data is accomplished through collations, and it can be optimized using indexes.

Related Topics:Windows Collation Sorting Styles, Indexes

Unicode

Unicode represents the characters of a language with two bytes rather than one, enabling a single Unicode character set to represent almost all of the written languages of the world. Unicode was developed, and is maintained and promoted by the Unicode Consortium, a nonprofit computer industry organization. For more information, see the Unicode Consortium Web site.

If you store character data that reflect multiple languages, always use Unicode data types (nchar, nvarchar, and ntext) instead of the non-Unicode data types (char, varchar, and text). You may experience a significant performance gain through use of Unicode because fewer code-page conversions will be required. There are significant limitations associated with non-Unicode data types, because a non-Unicode computer will be limited to use of a single code page. To completely evaluate issues related to use of Unicode or non-Unicode data types, you must test your scenario to measure performance differences in your particular environment. At a minimum, standardize the site collation, and deploy Unicode servers and clients wherever possible.

In most cases, your SQL Server instance will interact with other servers or clients, and your instance may use multiple data access standards. SQL Server clients are one of two main types:

Unicode clients that use OLE DB and Open Database Connectivity (ODBC) versions 3.7 or later.
Non-Unicode clients that use DB-Library and ODBC versions 3.6 or earlier.

The following table presents considerations for use of multilingual data with various combinations of Unicode and non-Unicode servers.

Server	Client	Benefits or Limitations
Unicode	Unicode	An ideal configuration because Unicode data will be used throughout the system. This scenario provides the best performance and protection from corruption of retrieved data. This is the case with Microsoft ActiveX Data Objects (ADO), OLE DB, and ODBC versions 3.7 or later.
Unicode	Non-Unicode	In this scenario, data storage may not be a problem, but there may be limitations when you move data to a client computer. At a minimum, Unicode data will have to be converted using the code page on the non-Unicode client.
Non-Unicode	Unicode	This is not an ideal configuration for use of multilingual data. You will not be able to write Unicode data to the non-Unicode server. Problems are likely to occur when data is sent to servers that are outside the server's code page.
Non-Unicode	Non-Unicode	This is a very limiting scenario for multilingual data. You will be limited to use of a single code page. The ideal configuration is a Unicode server with Unicode clients.

Related Topics:Unicode Basics