Understanding Encodings

Article
11/16/2012

Internally, the .NET Framework stores text as Unicode UTF-16. An encoder transforms this text data to a sequence of bytes. A decoder transforms a sequence of bytes into this internal format. An encoding describes the rules by which an encoder or decoder operates. For example, the UTF8Encoding class describes the rules for encoding to and decoding from a sequence of bytes representing text as Unicode UTF-8. Encoding and decoding can also include certain validation steps. For example, the UnicodeEncoding class checks all surrogates to make sure they constitute valid surrogate pairs. Both of these classes inherit from the Encoding class.

Choosing an Encoding

The Unicode Standard assigns a code point (a number) to each character in every supported script. A Unicode Transformation Format (UTF) is a way to encode that code point. For more information about the UTFs supported by the classes in System.Text, see Using Unicode Encoding in Unicode in the .NET Framework.

Selecting an Encoding Class

The Encoding class is very general. Supported classes inheriting from Encoding allow .NET applications to work with the common encodings they are likely to encounter in legacy applications, and you can implement additional encodings. However, when you have the opportunity to choose an encoding, you are strongly recommended to use a Unicode encoding, typically either UTF8Encoding or UnicodeEncoding (UTF32Encoding is also supported). In particular, UTF8Encoding is preferred over ASCIIEncoding. If the content is ASCII, the two encodings are identical, but UTF8Encoding can also represent every Unicode character, while ASCIIEncoding supports only the Unicode character values between U+0000 and U+007F. Because ASCIIEncoding does not provide error detection, UTF8Encoding is also better for security.

UTF8Encoding has been tuned to be as fast as possible and should be faster than any other encoding. Even for content that is entirely ASCII, operations performed with UTF8Encoding are faster than operations performed with ASCIIEncoding. You should consider using ASCIIEncoding only for certain legacy applications. However, even in this case, UTF8Encoding might still be a better choice. Assuming default settings, the following scenarios can occur:

If your application has content that is not strictly ASCII and encodes it with ASCIIEncoding, each non-ASCII character encodes as a question mark ("?"). If the application then decodes this data, the information is lost.
If the application has content that is not strictly ASCII and encodes it with UTF8Encoding, the result seems unintelligible if interpreted as ASCII. However, if the application then decodes this data, the data performs a round trip successfully.

Choosing a Fallback Strategy

When an application attempts to encode or decode a character but no mapping exists, it must implement a fallback strategy, which is a failure-handling mechanism. There are two types of fallback strategies:

Best fit fallback

When characters do not have an exact match in the target encoding/decoding, the application can attempt to map them to a similar character.
Replacement string fallback

If there is no appropriate similar character, the application can specify a replacement string.

For example, an application can call GetEncoding(1252, 0, 0) (see GetEncoding). This call specifies Code Page 1252 (the Windows Code Page for Western European languages) with encoderFallback and decoderFallback specified as zero. The default behavior is a best fit mapping for certain Unicode characters. For example, CIRCLED LATIN CAPITAL LETTER S (U+24C8) is changed to LATIN CAPITAL LETTER S (U+0053) before it is encoded, while SUPERSCRIPT FIVE (U+2075) is changed to DIGIT FIVE (U+0035). If the application then decodes from Code Page 1252 back to Unicode, the circle around the letter is lost and 25 becomes 25. Other conversions might be even more drastic. For instance, the Unicode INFINITY symbol (U+221E) might be mapped to DIGIT EIGHT (U+0038).

Best fit strategies vary for different code pages and they are not documented in detail. For example, for some code pages, full-width Latin characters map to the more common half-width Latin characters. For other code pages, this mapping is not made.

Even under an aggressive best fit strategy, there is no imaginable fit for some characters in some encodings. For example, a Chinese ideograph has no reasonable mapping to Code Page 1252. In this case, a replacement string is used. By default, this string is just a single QUESTION MARK (U+003F).

Best fit mapping is the default behavior for Encoding, which encodes Unicode data into code page data, and there are legacy applications that rely on this behavior. However, most new applications should avoid best fit behavior for security reasons. For example, applications should not put a domain name through a best fit encoding.

Your applications should use the following alternatives to best fit mapping:

Use only Unicode encodings (UTF8Encoding, UnicodeEncoding, and UTF32Encoding) to avoid fallback issues.

Warning

While UTF7Encoding is technically a Unicode encoding, it is less robust and secure than the other encodings. In some situations, changing one bit can radically alter the interpretation of an entire UTF-7 string. In other situations, substantially different UTF-7 strings can encode the same text. Consequently, UTF-7 should not be used when you have a choice. UTF-8 is preferred over UTF-7.
Use EncoderExceptionFallback and DecoderExceptionFallback, which throw an exception (EncoderFallbackException and DecoderFallbackException, respectively) if a character does not map exactly.
Always use EncoderReplacementFallback and DecoderReplacementFallback to substitute a replacement string if a character does not map exactly. This is the default behavior for ASCIIEncoding. By default, this string is just a question mark, but methods are provided that allow an application to choose a different string. Although this is typically a single character, it is not a requirement. For DecoderReplacementFallback, which is used when transforming text into Unicode, one character commonly used is REPLACEMENT CHARACTER (U+FFFD).
Use a customized EncoderFallback and/or DecoderFallback to implement a preferred strategy. See the Fallback Encoding Application Sample.

Two further notes about best fit encoding (or decoding) fallback strategies:

Best fit is mostly an encoding issue, not a decoding issue. There are very few code pages that contain characters that cannot be mapped successfully to Unicode. Since these characters are not commonly used, they have been omitted from Unicode.
There are no supported named objects corresponding to the best fit fallbacks. The best fit fallback for each code page is distinct. If your application needs to switch back and forth between the best fit and some other fallback for a single Encoding object, it should copy the original best fit object to a variable before assigning any other fallback object. The application can then recover the best fit fallback by assigning that value back to Encoding.EncoderFallback or Encoding.DecoderFallback.

Understanding Encodings

Choosing an Encoding

Selecting an Encoding Class

Choosing a Fallback Strategy

See Also

Reference

Other Resources

Additional resources