Invalid characters and escape rules
Applies to: SQL Server Azure SQL Database Azure SQL Managed Instance
This article describes how invalid XML characters are handled by the FOR XML clause, and lists the escape rules for characters that are invalid in XML names.
FOR XML and invalid characters
SQL Server entitizes invalid XML characters when they're returned within FOR XML queries that don't use the TYPE directive.
Although XML 1.0 conformant parsers raise parse errors regardless of whether these characters are entitized or not, the entitized form is better aligned with XML 1.1. The entitized form is also potentially better aligned with future versions of the XML standard. Additionally, it makes debugging simpler, because the code point of the invalid character becomes visible.
For users of XML tools, no workaround is required, because the XML parser will fail either way at the point where the invalid characters occur in the data stream. If you use non-XML tools, this change can require you to update your programming logic to search for these characters as entitized values.
The following white space characters are entitized differently in FOR XML queries to preserve their presence through round-tripping:
In element content and attributes:
hex(0D)
(carriage return)In attribute content:
hex(09)
(tab),hex(0A)
(line feed)
These characters are preserved in output, and a parser won't normalize them.
Escape rules
SQL Server names that contain characters that are invalid in XML names, such as spaces, are translated into XML names in a way in which the invalid characters are translated into escaped numeric entity encoding.
There are only two non-alphabetic characters that can occur within an XML name: the colon (:
) and the underscore (_
). Because the colon is already reserved for namespaces, the underscore is chosen as the escape character. Following are the escape rules that are used for encoding:
Any UCS-2 character that isn't a valid XML name character, according to the XML 1.0 specification, is escaped as
_xHHHH_
. TheHHHH
stands for the four-digit hexadecimal UCS-2 code for the character in the most significant bit-first order. For example, the table nameOrder Details
is encoded asOrder_x0020_Details
.Characters that don't fit into the UCS-2 realm (the UCS-4 additions of the range U+00010000 to U+0010FFFF) are encoded as
_xHHHHHHHH_
. TheHHHHHHHH
stands for the eight-digit hexadecimal UCS-4 encoding of the character, if under SQL Server 2000 backward compatibility mode. Otherwise, the characters are encoded as_xHHHHHH_
, in order to align with the ISO standard.The underscore character doesn't have to be escaped unless it's followed by the character
x
. For example, the table nameOrder_Details
is not encoded.The colon in identifiers isn't escaped. As a result, the namespace element and attribute names can be generated by the FOR XML query. For example, the following query generates a namespace attribute that has a colon in the name:
SELECT 'namespace-urn' as 'xmlns:namespace', 1 as 'namespace:a' FOR XML RAW;
The query produces this result:
<row xmlns:namespace="namespace-urn" namespace:a="1"/>
WITH XMLNAMESPACES is the recommended way to add XML namespaces.
See also
Feedback
https://aka.ms/ContentUserFeedback.
Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see:Submit and view feedback for