Preserving Client-Server Data Integrity with Unicode and Microsoft SQL Server 2000
Microsoft® SQL Server™ 2000
Summary: Learn about preserving the integrity of character data as it moves between client applications and SQL Server 2000. The particular focus is such data transmission in a global setting that may include a mix of Unicode and non-Unicode code pages. Topics include an explanation of Unicode and its impact on storage and performance, client-server data flow programming issues, and working with Data Transformation Services, bcp, BULK INSERT, and the XML functionality in SQL Server 2000. (17 printed pages)
What is Unicode?
Server-Side Programming Basics with Unicode
Managing Client-Server Data Flow Conversion in a Global Setting
Using Data Transformation Services
Using the bcp Utility and the BULK INSERT Command
Working with XML Data
Microsoft® SQL Server™ 2000 provides Unicode support in the storage and exchange of character data. With Unicode, you can store and process multilingual data without loss; if you are not using the Unicode standard, or if your environment includes a mix of Unicode and non-Unicode data, your multilingual data runs a greater risk of corruption (for reasons discussed later in this paper). Unicode therefore provides a benefit that usually outweighs any costs associated with storage or performance. This paper examines how to take advantage of Unicode in different client-server configurations that vary in their Unicode awareness.
Unicode is a standard for mapping code points to characters. It is designed to cover all the characters of all the languages of the world. As a result, there is no need for different code pages to handle different sets of characters.
SQL Server 2000 supports Unicode in a variety of ways, including:
- Providing the nchar, nvarchar, and ntext data types for storing character data and declaring parameters of stored procedures as Unicode.
- Recognizing Unicode when importing, exporting, copying, and converting data through the Data Transformation Services (DTS) Tools.
- Specifying Unicode format when copying data through the bcp utility.
- Recognizing XML data natively as Unicode.
Why Use Unicode?
The primary advantage of storing your character data in Unicode is that you do not have to worry about character corruption when data enters your database from the client side. For more information about character corruption, see the section How Do I Store Non-Unicode Data? later in this paper.
Also, using Unicode is the only way you can store data from multiple unrelated languages in a single column. With Unicode, your database essentially is ready for whatever character data comes its way.
How is Unicode Encoded?
Unicode does not specify how character data is encoded in binary format for storage in memory—it simply matches characters to code points, so that an application knows which character it is storing or processing. To store Unicode data in memory, there are three primary encoding schemes in use today: UCS-2, UTF-16, and UTF-8.
UCS-2 is the encoding scheme used by SQL Server 2000 to store Unicode data. Under this mechanism, all Unicode characters are stored using two bytes. UCS-2 is also the primary encoding mechanism used by the Microsoft Windows NT® 4.0 operating system.
UTF-16 stores some characters using two bytes, and others using up to four. UTF-16 is the primary encoding mechanism used by Microsoft Windows® 2000.
The UCS-2 encoding scheme is actually a subset of the UTF-16 scheme, since every UCS-2 encoded character is identical to the encoding of that character in UTF-16. The primary difference between UCS-2 and UTF-16 is in the recognition of surrogate characters. While the UCS-2 scheme can recognize 65,536 different characters, some languages, such as Chinese, need to define additional characters that, although rarely used, are found in classical literature and have historical significance. These characters, known as surrogates, are stored straightforwardly in multiple bytes under the UTF-16 encoding scheme, which can recognize 1,048,576 characters. Because SQL Server uses UCS-2 storage, it treats surrogate characters as two undefined Unicode characters that, when paired together, define a surrogate character in storage. This way, SQL Server stores surrogate characters without risk of loss or corruption. SQL Server 2000 tools such as Query Analyzer, however, do not support the display or manipulation of surrogate characters. These operations must be handled by the client-side application.
A third encoding scheme, UTF-8, stores Unicode data in variable lengths of between one and four bytes. Many database programs, such as those developed by Oracle and Sybase, use this encoding scheme. When UTF-8 data interacts with Microsoft Component Object Model (COM) application program interfaces (APIs), or Windows NT and Windows 2000 components, it is converted either to UTF-16 or UCS-2 format using simple algorithms and APIs that perform this conversion efficiently. UTF-8 can be slower than UCS-2 for performing sorts, comparisons, and other operations on strings, because characters do not have fixed widths.
One way that SQL Server 2000 supports the UTF-8 encoding scheme is in its storage of XML text data, in order to take advantage of the flexibility of UTF-8 in traversing across networks. For more information, see the section Working with XML Data later in this paper.
The following data types are available in SQL Server 2000 for defining Unicode character data in tables and as parameters of stored procedures:
For character data with a fixed length of n characters, not to exceed 4,000. Storage size is two times n bytes. Use nchar when the data entries in a column are expected to be consistently close to the same size.
For character data with lengths that vary, but cannot exceed n characters, up to 4,000. Storage size, in bytes, is two times the number of characters entered. Use nvarchar when the data entries in a column are expected to vary considerably in size.
Variable-length data with a maximum length of 230 - 1 (1,073,741,823) characters. Storage size, in bytes, is two times the number of characters entered. Use ntext for data longer than 4,000 characters.
The non-Unicode equivalents of the nchar, nvarchar, and ntext data types in SQL Server 2000 are listed below. When Unicode data is inserted into one of these non-Unicode data type columns through a command string (otherwise known as a "language event"), SQL Server converts the data to the data type using the code page associated with the collation of the column. When a character cannot be represented on a code page, it is replaced by a question mark (?), indicating the data has been lost. Appearance of unexpected characters or question marks in your data indicates your data has been converted from Unicode to non-Unicode at some layer, and this conversion resulted in lost characters.
For character data with a fixed length of n bytes, not to exceed 8,000. Storage size is n bytes.
For character data with lengths that vary, but cannot exceed n characters, up to 8,000. Storage size is the actual length of the data entered.
Variable-length data in the code page of the server and with a maximum length of 231 - 1 (2,147,483,647) characters. When the server code page uses double-byte characters, the maximum storage is still 2,147,483,647 bytes.
What Does Unicode Have to Do with Collations?
When a collation is defined on non-Unicode data, the collation dictates both the sorting rules for the data as well as the code page used to store the data. A collation defined on Unicode data dictates only the sorting rules, not the code page. Unicode is, after all, its own storage specification.
There are two types of collations to choose from in SQL Server 2000: Windows collations and SQL collations. Windows collations define sorting rules based on those for an associated Windows locale. To do this, SQL Server replicates the Windows 2000 Server sorting rules and applies them to a corresponding Windows collation. This way, string comparisons for a given Windows collation in SQL Server are compatible with the same operations for a version of Windows running the same collation. However, because versions of Windows later than Windows 2000 Server, such as Windows XP and Windows Server 2003, use different sorting tables, the Windows collations of SQL Servers installed on these operating systems may display different sorting behavior than that of the host OS.
SQL collations are not associated with Windows locales. They are provided for compatibility with sort orders in earlier versions of SQL Server.
Windows collations actually apply Unicode-based sorting rules to both Unicode and non-Unicode data alike. This means that SQL Server internally converts non-Unicode data to Unicode to perform comparison operations. Doing so provides consistency across data types in SQL Server, and also provides developers the ability to sort strings in their applications using the same rules that SQL Server uses (by calling the CompareStringW Win32 API). SQL collations, on the other hand, apply non-Unicode sorting rules to non-Unicode data, and Unicode sorting rules to Unicode data (using a corresponding Windows collation for the Unicode data). This discrepancy can result in different results for comparisons of the same characters. Therefore, if you have a mix of Unicode and non-Unicode columns in your database, they should all be defined with Windows collations so that the same sorting rules are used across Unicode and non-Unicode data.
It is also important to realize the impact of disabling auto-translation on the client side with respect to Windows collations. As mentioned later in this paper, developers often disable autotranslation under the impression that this action will facilitate client-server cross-code page interaction. Not only is this action unsupported, but to perform sorts and scans of non-Unicode data defined with a Windows collation, SQL Server converts the data to Unicode prior to performing the sort. Therefore, if autotranslation is disabled, the translated Unicode data may not be converted correctly back to its original non-Unicode code page when it is retrieved on the client side.
For additional information on collations, see the SQL Server Books Online topic, "Collations," as well as its subtopics.
What is the Storage Impact of Using Unicode?
The difference in storing character data between Unicode and a non-Unicode code page depends on whether a particular non-Unicode code page uses double-byte character sets. All non-East Asian languages, plus the Thai language, store non-Unicode characters in single bytes. Therefore, storing these languages as Unicode uses twice as much space as specifying a non-Unicode code page. On the other hand, the non-Unicode code pages of many other Asian languages specify character storage in double-byte character sets (DBCS). For these languages, therefore, there is almost no difference in storage between non-Unicode and Unicode.
Note When storing Asian DBCS data, the UCS-2 encoding method used by SQL Server 2000 tends to be more efficient than the UTF-8 method used by many other database programs. This is because UTF-8 uses three bytes to store most Asian language characters, while UCS-2 uses just two. On the other hand, for non-DBCS languages, such as ASCII-based characters, UTF-8 usually uses only one byte per character, while UCS-2 uses two.
The following non-Unicode code pages specify character data storage in double-byte character sets:
What is the Performance Impact of Using Unicode?
The impact of Unicode data on performance is complicated by a variety of factors, including:
- The difference between Unicode sorting rules and non-Unicode sorting rules
- The difference between sorting double-byte and single-byte characters
- Code conversion between client and server
As mentioned previously, SQL Server performs string comparisons of non-Unicode data defined with a Windows collation using Unicode sorting rules. These rules are much more complex than non-Unicode sorting rules, and are therefore more resource-intensive. So, although generally more expensive, there is usually little difference in performance between Unicode data and non-Unicode data defined with a Windows collation.
The only case when SQL Server uses non-Unicode sorting rules is on non-Unicode data defined with a SQL collation. Sorts and scans in this instance tend to be faster than when Unicode sorting rules apply. Unicode sorting rules apply to all Unicode data, using either a Windows or SQL collation.
Of secondary importance, sorting large amounts of Unicode data can be slower than non-Unicode data, simply because the data is stored in double bytes. On the other hand, sorting Asian characters in Unicode is faster than sorting Asian DBCS data in a specific code page, because DBCS data is actually a mixture of single- and double-byte widths, while Unicode characters are fixed-width.
Other performance issues are determined primarily by the issue of converting the encoding mechanism between SQL Server and the client. In general, the effects on performance of client-server code page conversion are usually negligible. Nevertheless, it is important to understand what is happening at this layer.
The ODBC (version 3.6 or earlier) and the DB-Library APIs do not recognize Unicode. For clients using data access methods defined by these APIs, resources are used to implicitly convert Unicode data to the client code page. In addition, there is a risk of data corruption on the client side when the client code page does not recognize certain Unicode characters.
Later versions of ODBC (beginning with version 2.7 of Microsoft Data Access Components that shipped with SQL Server version 7.0), as well as OLE DB and ADO, are Unicode aware, and assume a UCS-2 encoding mechanism. Therefore, there are no conversion issues when working with strictly Unicode data from SQL Server if the application is Unicode enabled. If a client is using a Unicode-enabled API, but the data storage mechanism in SQL Server is not Unicode, there are no conversion issues. However, there is a risk that any data inserts or updates will be corrupted if any character's code points cannot be mapped to the SQL Server code page.
What are the Overall Recommendations for Using Unicode?
Deciding whether to store non-DBCS data as Unicode usually boils down to an awareness of storage impacts, as well as how much sorting, conversion, and possible data corruption may take place during client interactions with the data. Sorting and conversion may impact performance, depending on where it takes place, although for most applications the impact is negligible. Databases with well-designed indexes are especially unlikely to be affected. Data corruption, however, will impact not only the integrity of your application and database, but your business in general. With this trade-off in mind, storing character data in a specific code page may make sense if both of the following are true:
- Conserving storage space is an issue due to hardware limitations.
You are performing frequent sorts of large amounts of data, and testing indicates that a Unicode storage mechanism severely impacts performance.
- You are certain the code pages of all clients accessing this data match yours, and that this situation will not unexpectedly change.
In most other cases, the decision to store character data, even non-DBCS data, in Unicode should be based more on business needs than performance. In a global economy—encouraged by explosive growth in Internet traffic—it is becoming more important than ever to support client computers running different locales. In addition, it is becoming increasingly difficult to pick a single code page that supports all of the characters required by a worldwide audience.
Enabling your database for Unicode involves not only defining Unicode storage, but defining Unicode-aware client interactions as well. This is done on the server side by:
- Switching from non-Unicode data types to Unicode data types in table columns and in CONVERT() and CAST() operations.
- Substituting use of ASCII() and CHAR() functions with their Unicode equivalents, UNICODE() and NCHAR().
- Defining variables and parameters of stored procedures and triggers in Unicode.
- Prefixing Unicode character string constants with the letter N.
Using UNICODE(), NCHAR(), and Other Functions
The ASCII() function returns the non-Unicode character code of the character passed in, so use the counterpart UNICODE() function for Unicode strings where you would use the ASCII function on non-Unicode strings. The same is true of the CHAR function; NCHAR is its Unicode counterpart.
The SOUNDEX() function is defined around English phonetic rules, so is not meaningful on Unicode strings unless the string only contains the Latin characters A-Z/a-z.
ASCII, CHAR, and SOUNDEX can be passed Unicode parameters, but these arguments are implicitly converted to non-Unicode strings, resulting in the possible loss of Unicode characters, before processing because these functions operate on non-Unicode strings by definition.
In addition to the UNICODE() and NCHAR() functions, the following string manipulation functions support Unicode wherever possible: CHARINDEX(), LEFT(), LEN(), UPPER(), LOWER(), LTRIM(), RTRIM(), PATINDEX(), REPLACE(), QUOTENAME(), REPLICATE(), REVERSE(), STUFF(), SUBSTRING(), UNICODE(). These functions accept Unicode arguments, respect the two-byte character boundaries of Unicode strings, and use Unicode sorting rules for string comparisons when the input parameters are Unicode.
Defining Parameters in Stored Procedures
Defining parameters with a Unicode data type ensures that client requests or input are implicitly converted to Unicode on the server, and not corrupted in the process. If the parameter is specified as an OUTPUT parameter, a Unicode type also minimizes the chance of corruption on its way back to the client.
In the following stored procedure, both variables are declared as Unicode data types:
CREATE PROCEDURE au_info @lastname nvarchar(40), @firstname nvarchar(40) AS SELECT au_lname, au_fname, title, pub_name FROM authors a INNER JOIN titleauthor ta ON a.au_id = ta.au_id INNER JOIN titles t ON t.title_id = ta.title_id INNER JOIN publishers p ON t.pub_id = p.pub_id WHERE au_fname = @firstname AND au_lname = @lastname
Using the N Prefix
Unicode string constants that appear in code executed on the server, such as in stored procedures and triggers, must be preceded by the capital letter N. Without the N prefix, the string is converted to the default code page of the database, which may not recognize certain characters.
For example, the stored procedure created in the example above can be executed on the server in the following manner:
EXECUTE au_info @lastname = N'', @firstname = N''
EXECUTE au_info @lastname = N'Dale', @firstname = N'Ann'
In the following example of a stored procedure created on the server, the string constant, '' must be preceded with 'N', even though the column being referenced is already defined as Unicode:
CREATE PROCEDURE Chinese_Authors AS SELECT au_lname FROM Authors WHERE Language = N''
This requirement holds both for string constants that originate on the server, as well as those sent from the client, as explained below.
Understanding Unicode and its use in the database is important, but is only half of the equation. The other half involves how client applications interact with the server under a variety of scenarios that depend on the degree to which the client and server are Unicode-aware.
Here are four common scenarios:
- A client-side application is programmed to interact with data stored using a specific code page. This application must be replicated for use in another office that will interact with the same data set, but must process the data in another code page. How can this new office, using one code page, use data in another code page?
- The storage of character data in SQL Server is in Unicode, but client-side applications are still ANSI code page-based. How can these ANSI-based applications talk to a Unicode server?
- Client-side applications recognize data in Unicode format, but the server side data store is still ANSI-based. How can these applications interact with the server?
- Both client and server are Unicode-aware, but applications read Unicode data in the UTF-8 or UTF-16 encoding scheme, while SQL Server stores Unicode data in UCS-2 format. How can data be converted between these two encoding schemes in a way that preserves the data flow?
Managing Data Conversion Between Client/Server Code Pages
In this situation, the server-side data storage is non-Unicode, and the client-side application is also not Unicode aware. If the server's data storage code page and the client-side application's code page are the same, then there is no problem. But if these code pages differ, the conversion that takes place between client and server may result in the loss of some characters.
Disabling the AutoTranslate feature of the SQL Server ODBC driver to insert data defined with a different code page from the server is not supported. Also, even if AutoTranslate is disabled, it does not prevent code page translation for SQL language events. The result is that if the client and database code pages do not match, code page translation will generally be applied to any non-Unicode character string sent to or from the server. (For more information about AutoTranslate, see Autotranslation of Character Data and PRB: SQL Server ODBC Driver Converts Language Events to Unicode.
If possible, you should avoid this situation. The best choice for a code page-specific server is to communicate only with clients using the same code page. The second-best choice is to use another code page that has an almost identical character set. For example, code page 1252 (Latin1) and code page 850 (Multilingual Latin1) can store an almost identical set of characters, so most characters in these two code pages can be converted from one code page to another without data loss.
If you must communicate with clients using different code pages, the supported solution is to store your data in Unicode columns. If any of these options is not feasible, the remaining alternative is to store the data in binary columns using the binary, varbinary, or image data types. Binary data, however, can only be sorted and compared in binary order, making it less flexible than character data.
Managing Data Conversion between a Unicode Server and a non-Unicode client
In this situation, server-side data storage is in Unicode, but the client-side application uses a specific code page.
When non-Unicode data is sent from the client to be stored on the server in Unicode, data from any client with any code page can be stored correctly as long as one of the following conditions is true:
- Character strings are sent to the server as parameters of a remote procedure call (RPC).
- String constants are preceded with the capital letter N. This requirement is necessary regardless of whether or not your client-side application is Unicode aware. Without the N prefix, SQL Server will convert the string to the code page that corresponds to the default collation of the database. Any characters not found in this code page will be lost. For more information about this requirement, see INF: Unicode String Constants in SQL Server Require N Prefix.
For example, the string, "
'" will not be recognized on an ANSI database, even if a Unicode column is queried. Instead, you must specify, "
If the client application is not Unicode enabled and retrieves the data into non-Unicode buffers, any given client will only be able to retrieve or modify data that can be represented by the client machine's code page. This means that ASCII characters can always be retrieved, since the representation of ASCII characters is the same in all code pages, while any non-ASCII data depends on code-page-to-code-page conversion.
For example, suppose an application that is currently running only in the United States (U.S.) is deployed to Japan. The SQL Server database is Unicode-aware so that both English and Japanese text can be stored in the same tables, but the application has not yet been modified to deal with text as Unicode. As long as the application complies with one of the two options above, Japanese users can use the non-Unicode application to input and retrieve Japanese data, and U.S. users can input and retrieve English data. All data from both sets of users is stored intact in the same column of the database, and represented as Unicode. In this situation, a Unicode-enabled reporting application to generate reports spanning the entire data set, for example, can be deployed. However, English users cannot view the Japanese rows since the application is not capable of displaying any characters that do not exist in their code page (1252).
This situation may be acceptable if the two groups of users do not need to view the other group's records. If it is a requirement that a user of the application must be able to view or modify records with text that cannot be represented by a single code page, there is no choice but to modify the application to use Unicode.
If the client-side program is Web-based or connects to a Microsoft Active Server Pages (ASP) page, there are metadata specifications on both the client-side HTML page and the server-side ASP page. These specifications must be made to specify how character strings should be converted between the server, the ASP engine, and the client browser.
On the client side HTML page, the META attribute must specify that the character set data should be converted to the client's encoding scheme by specifying a CHARSET code. For example, the following HTML page instructs the client to convert character data to the 950 (Chinese Traditional) code page by specifying "big5" as the CHARSET code (Charset codes for the META attribute can be found at CharSet Property).
<HTML> <HEAD> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=big5"> <!-- --> </HEAD> <BODY> <!-- body --> </BODY> </HTML>
On the server-side ASP page, you must instruct the ASP application what code page the client browser is using. You can specify the session.codepage property, or the @CodePage directive. These methods will handle conversion of data from server to client, as well as both GET and POST client requests. In the following examples, both methods are used to specify conversion to and from the client's code page, which is 950 (Chinese Traditional).
<%@ Language=VBScript codepage=950 %> <% Session.CodePage=950 %>
And finally, you must still remember to prefix any string literals with the letter N.
Managing Data Conversion between a Non-Unicode Server and a Unicode client
In this situation, the client-side application is Unicode-aware, but is communicating with a server-side data storage that is not Unicode.
This situation is not much different from the previous situation. Here, the client is not constrained by only being able to handle characters that exist in the client's ANSI code page. The client can store or retrieve any characters that exist in the code page of the SQL Server database, even if this code page is different from that of the client. If it is a requirement that the database must store data from more than one code page, you must modify the database to use Unicode data types.
Managing Data Conversion Between Unicode Encoding Schemes
Here, both the server and client are Unicode-enabled, but as discussed previously, SQL Server stores Unicode in the UCS-2 encoding scheme, while many clients process Unicode in another encoding scheme, usually UTF-8. This scenario often occurs for Web-based applications.
In essence, you are still converting from one encoding scheme to another; so many of the same solutions discussed earlier in this paper also apply here. Unicode character string constants sent to the server must be preceded with a capital N. For Web-based applications, you specify the CHARSET code under the META attribute of the client-side HTML page. For example, specify
CHARSET = utf-8 if the client's Unicode encoding scheme is UTF-8. On the server side, specify the client's encoding scheme using the session.codepage property or the @codepage directive. For example,
codepage=65001 specifies a UTF-8 encoding scheme. If you follow these measures, Microsoft Internet Information Services (IIS) 5.0 or later will seamlessly handle the conversion from UTF-8 to UCS-2 and back without any extra effort on your part.
Additional options for handling this conversion can be found at INF: Storing UTF-8 Data in SQL Server.
In Microsoft Visual Basic® applications, character strings are processed in the UCS-2 encoding scheme, and therefore there is no need to specify encoding scheme conversion explicitly between these applications and a SQL Server.
The advantages of working with Unicode are much the same when moving data through Data Transformation Services (DTS) as when moving data through client-server applications.
Importing and Exporting Data
When importing character data through DTS, defining Unicode on the server side ensures that imported data will be received intact and stored properly.
In addition, when either the source or the destination is not an instance of SQL Server, you must rely on the translation behavior of whatever ODBC driver or OLE DB provider is being used on the calling end. For example, if you want to import French OEM code page data into a 1252 code page SQL Server, and the source machine does not correctly perform conversion to the 1252 code page, data corruption may result.
Note When working with Unicode data, the DTS Import/Export Wizard contains certain dialog boxes that may not display Unicode data properly, but will not corrupt the data during transfer.
Copying Data Between Servers
When copying character data between instances of SQL Server, defining Unicode storage on both sides is the easiest way to ensure against data corruption, regardless of which direction you are moving the data. This precaution is especially valid when working with an instance of SQL Server 7.0, because non-Unicode data is limited to a single code page across the instance. If SQL Server 7.0 is the destination database, source database, or both, a "best-fit" mapping is attempted between code pages when non-Unicode data types are used.
In addition, DTS is an OLE DB application and makes use of Autotranslation to translate data from one code page to another. This means that all data retrieved is converted from the code page of the database or column to the ANSI code page of the machine running DTS. Then, when the data is inserted into the destination server, it is converted from the client's ANSI code page to the destination database's code page. For a character to survive translation intact, it is not enough for the character to exist in both the source and destination code pages; it must also exist in the ANSI code page of the machine where DTS is running. Therefore, if you are copying non-ANSI data (any code page that does not begin with "125"), it is best to turn auto-translation off on both the source and destination to prevent this "middle-tier" conversion to ANSI from occurring. Of course, doing so can result in data corruption and an unsupported configuration unless both the source and destination SQL Server databases use the exact same collation.
When copying data between two instances of SQL Server 2000, and both sides are not stored in Unicode, you should set the UseCollation property of the Copy SQL Server Objects Task to guard against data corruption. However, keep in mind that DTS only relies on the server-level collations to determine which code pages to convert to and from. If a server's collation is different from the database or column-level collation, data can be corrupted in unexpected ways.
Another option for guarding against data corruption when copying non-Unicode data is to use the SQL Server bcp utility in the following manner:
- Use bcp to export the data to a flat file. By specifying the –w or –N command line flags (see below), the data is converted to Unicode in the process.
- Use either bcp or DTS to import the Unicode data to the destination.
This method, however, will still lose any characters from the source data that do not exist in the destination code page.
Using the bcp utility is explained in more detail below.
For more information about DTS conversion issues, see Data Conversion and Transformation Considerations
When using the bcp utility to copy data from SQL Server to another source, character data is, by default, converted to the OEM code page of the machine where bcp is running. When using bcp or the BULK INSERT statement to copy data from an outside source into SQL Server, character data is, by default, converted from the OEM code page of the client machine to the code page of the destination column. If Unicode is not specified on both sides, corruption may result. To minimize character corruption during the conversion process, you can override the default behavior by specifying one of the following flags using bcp:
|-C xxx||Code page specifier||xxx specifies that the data is converted to ANSI, OEM, RAW (direct copy with no conversion), or a specific code page number.|
|-N||Unicode native format||Converts data to native database data types for all noncharacter data, and Unicode character format for all character data.|
|-w||Unicode character format||Converts data to Unicode character data format for all columns|
For example, to bulk copy the authors2 table in the pubs database to the Authors.txt data file using code page 850, execute the following from the command prompt:
bcp pubs..authors2 out authors.txt -c -C850 -Sservername -Usa -Ppassword
When using BULK INSERT, you can specify that the data be converted to a specific code page with the CODEPAGE clause. For example, to bulk copy the Authors.txt data file into the authors2 table in the pubs database using code page 850, execute the following from a query tool such as SQL Query Analyzer:
BULK INSERT pubs..authors2 FROM 'c:\authors.txt' WITH ( CODEPAGE = 850 )
The following values can be used with both bcp and BULK INSERT to specify code page conversion:
|Code page value||Description|
|OEM (this is the default behavior)||Columns of char, varchar, or text data type are converted from the system OEM code page to the SQL Server code page when importing data to an instance of SQL Server, and vice versa when exporting data from an instance of SQL Server.|
|ACP||Columns of char, varchar, or text data type are converted from the ANSI/Windows code page (ISO 1252) to the SQL Server code page when importing data to an instance of SQL Server, and vice versa when exporting data from an instance of SQL Server.|
|RAW||This is the fastest option because no conversion from one code page to another occurs.|
|<value>||This represents a specific code page number (for example, 850).|
For more information about using bcp and BULK INSERT, refer to Copying Data Between Different Collations.
Although in many cases SQL Server 2000 encodes the XML data it creates internally as UCS-2, it encodes XML data received from other sources as Unicode using the UTF-8 encoding scheme. As mentioned above, UTF-8 data is variable-width, so it is processed according to a byte-oriented protocol. This means that UTF-8 data can be treated in a way that is independent of the byte ordering on different computers (little endian versus big endian). It is therefore well suited for traversing different computers using different encodings and byte-ordering systems. Because XML data is typically shared widely across networks, it makes sense to maintain the default UTF-8 storage of XML data in your database, and when exporting XML data to clients.
If you do need to specify a different encoding, you can do so in FOR XML requests by:
- Specifying the Output Encoding property of an XML-formatted data stream Response object in ASP.
For example, the following ASP code tells the browser to display an incoming XML data stream in UCS-2:
<% cmdXML.Properties("Output Encoding") = "UCS-2" %>
- Specifying an output encoding in a URL when making an HTTP request:
This example specifies UCS-2 as the output encoding of the XML document returned by this request:
- Specifying an output encoding in an XML template or style sheet.
This example specifies UCS-2 as the output encoding in the header of this XML template document:
<?xml version ='1.0' encoding='UCS-2'?> <root xmlns:sql='urn:schemas-microsoft-com:xml-sql' sql:xsl='MyXSL.xsl'> <sql:query> SELECT FirstName, LastName FROM Employees FOR XML AUTO </sql:query> </root>
Note that if an encoding is specified directly in an XSL style sheet, it will override whatever encoding is specified in the template. Both, however, are overridden by the Output Encoding property specified on the ASP page.
When inserting data to SQL Server using OPENXML, you should specify Unicode data types anywhere in the rowset where multilingual data may appear. This will minimize character corruption.
Support for Unicode in SQL Server 2000 provides you with greater confidence that your character data storage and transmission will not be lost or distorted in a global setting. The storage and performance costs of using Unicode are usually minimal compared to the benefits of handling multilingual data. When faced with client-server configurations that vary in their Unicode awareness, there are measures you can take to minimize the risk of data corruption. The optimal situation is both a Unicode-aware server and a Unicode-aware client, while the least favorable situation is both a non-Unicode client and non-Unicode server, each with differing code pages.