[This documentation is for preview only, and is subject to change in later releases. Blank topics are included as placeholders.]
Represents a character as a UTF-16 code unit.
Assembly: mscorlib (in mscorlib.dll)
Thetype exposes the following members.
|Equals||Indicates whether this instance and a specified object are equal. (Inherited from ValueType.)|
|GetHashCode||Serves as a hash function for a particular type. (Inherited from Object.)|
|GetType||Gets the Type of the current instance. (Inherited from Object.)|
|ToLower||Returns the lower case character.|
|ToString||Converts the value of this instance to its equivalent string representation. (Overrides Object..::..ToString()()()().)|
|ToUpper||Returns the upper case character.|
The structure represents a Unicode character. The Unicode Standard identifies each Unicode character with a unique 21-bit scalar number called a code point, and defines the UTF-16 encoding form that specifies how a code point is encoded into a sequence of one or more 16-bit values. Each 16-bit value ranges from hexadecimal 0x0000 through 0xFFFF and is stored in a structure. The value of a object is its 16-bit numeric (ordinal) value.
Char Objects, Unicode Characters, and Strings
A String object is a sequential collection of structures that represents a string of text. Most Unicode characters can be represented by a single object, but a character that is encoded as a base character, surrogate pair, and/or combining character sequence is represented by multiple objects. For this reason, a structure in a String object is not necessarily equivalent to a single Unicode character.
Multiple 16-bit code units are used to represent single Unicode characters in the following cases:
Glyphs, which may consist of a single character or of a base character followed by one or more combining characters. For example, the character ä is represented by a object whose code unit is U+0061 followed by a object whose code unit is U+0308. (The character ä can also be defined by a single object that has a code unit of U+00E4.) The following example illustrates that the character ä consists of two objects.
Characters outside the Unicode Basic Multilingual Plane (BMP). Unicode supports sixteen planes in addition to the BMP, which represents plane 0. A Unicode code point is represented in UTF-32 by a 21-bit value that includes the plane. For example, U+1D160 represents the MUSICAL SYMBOL EIGHTH NOTE character. Because UTF-16 encoding has only 16 bits, characters outside the BMP are represented by surrogate pairs in UTF-16. The following example illustrates that the UTF-32 equivalent of U+1D160, the MUSICAL SYMBOL EIGHTH NOTE character, is U+D834 U+DD60. U+D834 is the high surrogate; high surrogates range from U+D800 through U+DBFF. U+DD60 is the low surrogate; low surrogates range from U+DC00 through U+DFFF.
Characters and Text Elements
Because a single character can be represented by multiple objects, it is not always meaningful to work with individual objects. For instance, the following example converts the Unicode code points that represent the Aegean numbers zero through 9 to UTF-16 encoded code units. Because it erroneously equates objects with characters, it inaccurately reports that the resulting string has 20 characters.
You can do the following to avoid the assumption that a object represents a single character.
You can work with a String object in its entirety instead of working with its individual characters to represent and analyze linguistic content.
You can use the class to work with text elements instead of individual objects. The following example uses the object to count the number of text elements in a string that consists of the Aegean numbers zero through nine. Because it considers a surrogate pair a single character, it correctly reports that the string contains ten characters.
If a string contains a base character that has one or more combining characters, you can call the method to convert the substring to a single UTF-16 encoded code unit. The following example calls the method to convert the base character U+0061 (LATIN SMALL LETTER A) and combining character U+0308 (COMBINING DIAERESIS) to U+00E4 (LATIN SMALL LETTER A WITH DIAERESIS).
All members of this type are thread safe. Members that appear to modify instance state actually return a new instance initialized with the new value. As with any other type, reading and writing to a shared variable that contains an instance of this type must be protected by a lock to guarantee thread safety.