Surrogate Pair Characters in an XML Document

Article
11/16/2012

A surrogate or surrogate pair is a pair of 16-bit Unicode encoding values that, together, represent a single character. The key point to remember is that surrogate pairs are actually 32-bit single characters, and it can no longer be assumed that one 16-bit Unicode encoding value maps to exactly one character.

Working with Surrogate Pairs

The first value of the surrogate pair is the high surrogate and contains a 16-bit code value in the rage of U+D800 to U+DBFF. The second value of the pair, the low surrogate, contains values in the range of U+DC00 top U+DFFF. By using surrogate pairs, 16 bit Unicode encoded system can address one million and more additional characters(220) that have been defined by the Unicode standard.

You can use surrogate characters in any string passed to an XmlTextWriter method. However, the surrogate character should be valid in the XML being written. For example, the World Wide Web Consortium (W3C) recommendation does not allow surrogate characters inside element or attribute names. If the string contains an invalid surrogate pair, an exception is thrown.

Additionally, you can use WriteSurrogateCharEntity to write the character entity corresponding to a surrogate pair. The character entity is written in hexadecimal format and is generated using the formula:

(highChar -0xD800) * 0x400 + (lowChar -0xDC00) + 0x10000

If the string contains an invalid surrogate pair, an exception is thrown. The following example shows the WriteSurrogateCharEntity method with a surrogate pair as input.

// The following line writes &#x10000.
WriteSurrogateCharEntity ('\uDC00', '\uD800');

The following example produces a surrogate pair file, loads it into the XmlReader, and saves the file out with a new name. The original and new files are then loaded back into the application in the XML Document Object Model (DOM) structure for comparison.

char lowChar, highChar;
char [] charArray = new char[10];
FileStream targetFile = new FileStream("SurrogatePair.xml",
      FileMode.Create, FileAccess.ReadWrite, FileShare.ReadWrite);

lowChar = Convert.ToChar(0xDC00);
highChar = Convert.ToChar(0xD800);
XmlTextWriter tw = new XmlTextWriter(targetFile, null);
tw.Formatting = Formatting.Indented;
tw.WriteStartElement("root");
tw.WriteStartAttribute("test", null);
tw.WriteSurrogateCharEntity(lowChar, highChar);
lowChar = Convert.ToChar(0xDC01);
highChar = Convert.ToChar(0xD801);
tw.WriteSurrogateCharEntity(lowChar, highChar);
lowChar = Convert.ToChar(0xDFFF);
highChar = Convert.ToChar(0xDBFF);
tw.WriteSurrogateCharEntity(lowChar, highChar);

// Add 10 random surrogate pairs.
// As Unicode, the high bytes are in lower
// memory; for example, word 6A21 as 21 6A.
// The high or low is in the logical sense.
Random random = new Random();
for (int i = 0; i < 10; ++i) {
      lowChar = Convert.ToChar(random.Next(0xDC00, 0xE000));
      highChar = Convert.ToChar(random.Next(0xD800, 0xDC00));
      charArray[i] = highChar;
      charArray[++i] = lowChar;
}
tw.WriteChars(charArray, 0, charArray.Length);

for (int i = 0; i < 10; ++i) {
      lowChar = Convert.ToChar(random.Next(0xDC00, 0xE000));
      highChar = Convert.ToChar(random.Next(0xD800, 0xDC00));
      tw.WriteSurrogateCharEntity(lowChar, highChar);
}

tw.WriteEndAttribute();
tw.WriteEndElement();
tw.Flush();
tw.Close();

XmlTextReader r = new XmlTextReader("SurrogatePair.xml");

r.Read();
r.MoveToFirstAttribute();
targetFile = new FileStream("SurrogatePairFromReader.xml",
       FileMode.Create, FileAccess.ReadWrite, FileShare.ReadWrite);

tw = new XmlTextWriter(targetFile, null);
tw.Formatting = Formatting.Indented;
tw.WriteStartElement("root");
tw.WriteStartAttribute("test", null);
tw.WriteString(r.Value);
tw.WriteEndAttribute();
tw.WriteEndElement();
tw.Flush();
tw.Close();

// Load both result files into the DOM and compare.
XmlDocument doc1 = new XmlDocument();
XmlDocument doc2 = new XmlDocument();
doc1.Load("SurrogatePair.xml");
doc2.Load("SurrogatePairFromReader.xml");
if (doc1.InnerXml != doc2.InnerXml) {
      Console.WriteLine("Surrogate Pair test case failed");
}

When writing using the WriteChars method, which writes a buffer of data at a time, there is the possibility of a surrogate pair in the input accidentally being split across a buffer. Since surrogate values are well defined, if the WriteChars encounters a Unicode value from either the lower range or the upper range, it identifies that value as one half of the surrogate pair. When the situation is encountered where the WriteChars would result in a write from the buffer splitting a surrogate pair, an exception is thrown. Use the IsHighSurrogate method to check whether the buffer ends with a high surrogate character. If the last character in the buffer is not a high surrogate, you can pass the buffer to the WriteChars method.

Share via

Surrogate Pair Characters in an XML Document

Working with Surrogate Pairs

See Also

Concepts

Additional resources