Character Classes 

A character class represents a set of characters that can match an input string. Combine literal characters, escape characters, and character classes to form a regular expression pattern.

Character classes define sets of characters. Some character classes are equivalent to one or more Unicode general category values or Unicode blocks. A Unicode general category defines the broad classification of a character; that is, whether the character is a type of letter, decimal digit, separator, mathematical symbol, punctuation, and so on. For example, the Lu general category represents "Letter, Uppercase" and the Sm category represents "Symbol, Math". For more information, see Supported Unicode General Categories.

A Unicode block is a named range of Unicode code points. The .NET Framework provides a set of named blocks derived from the Unicode block names. For example, the .NET Framework provides the IsBasicLatin named block, which corresponds to the Basic Latin Unicode block and contains characters ranging from U+0000 through U+007F. For more information, see Supported Named Blocks.

The .NET Framework supports character class subtraction expressions, which enables you to define a set of characters as the result of excluding one character class from another character class. For more information, see Character Class Subtraction.

Character Class Syntax

The following table summarizes the character classes and their syntax.

Character class Description

[character_group]

(Positive character group.) Matches any character in the specified character group.

The character group consists of one or more literal characters, escape characters, character ranges, or character classes that are concatenated.

For example, to specify all vowels, use [aeiou]. To specify all punctuation and decimal digit characters, code [\p{P}\d].

[^character_group]

(Negative character group.) Matches any character not in the specified character group.

The character group consists of one or more literal characters, escape characters, character ranges, or character classes that are concatenated. The leading carat character (^) is mandatory and indicates the character group is a negative character group instead of a positive character group.

For example, to specify all characters except vowels, use [^aeiou]. To specify all characters except punctuation and decimal digit characters, use [^\p{P}\d].

[firstCharacter-lastCharacter]

(Character range.) Matches any character in a range of characters.

A character range is a contiguous series of characters defined by specifying the first character in the series, a hyphen (-), and then the last character in the series. Two characters are contiguous if they have adjacent Unicode code points. Two or more character ranges can be concatenated.

For example, to specify the range of decimal digits from '0' through '9', the range of lowercase letters from 'a' through 'f', and the range of uppercase letters from 'A' through 'F', use [0-9a-fA-F].

.

(The period character.) Matches any character except \n. If modified by the Singleline option, a period character matches any character. For more information, see Regular Expression Options.

Note that a period character in a positive or negative character group (a period within square brackets) is treated as a literal period character, not a character class.

\p{name}

Matches any character in the Unicode general category or named block specified by name (for example, Ll, Nd, Z, IsGreek, and IsBoxDrawing).

\P{name}

Matches any character not in Unicode general category or named block specified in name.

\w

Matches any word character. Equivalent to the Unicode general categories [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}\p{Lm}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \w is equivalent to [a-zA-Z_0-9].

\W

Matches any nonword character. Equivalent to the Unicode general categories [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}\p{Lm}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \W is equivalent to [^a-zA-Z_0-9].

\s

Matches any white-space character. Equivalent to the escape sequences and Unicode general categories [\f\n\r\t\v\x85\p{Z}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \s is equivalent to [ \f\n\r\t\v].

\S

Matches any non-white-space character. Equivalent to the escape sequences and Unicode general categories [^\f\n\r\t\v\x85\p{Z}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \S is equivalent to [^ \f\n\r\t\v].

\d

Matches any decimal digit. Equivalent to \p{Nd} for Unicode and [0-9] for non-Unicode, ECMAScript behavior.

\D

Matches any nondigit character. Equivalent to \P{Nd} for Unicode and [^0-9] for non-Unicode, ECMAScript behavior.

Supported Unicode General Categories

Unicode defines the general categories and descriptions listed in the following table. For more information, see the "UCD File Format" and "General Category Values" subtopics at the Unicode Character Database.

Category Description

Lu

Letter, Uppercase

Ll

Letter, Lowercase

Lt

Letter, Titlecase

Lm

Letter, Modifier

Lo

Letter, Other

Mn

Mark, Nonspacing

Mc

Mark, Spacing Combining

Me

Mark, Enclosing

Nd

Number, Decimal Digit

Nl

Number, Letter

No

Number, Other

Pc

Punctuation, Connector

Pd

Punctuation, Dash

Ps

Punctuation, Open

Pe

Punctuation, Close

Pi

Punctuation, Initial quote (may behave like Ps or Pe depending on usage)

Pf

Punctuation, Final quote (may behave like Ps or Pe depending on usage)

Po

Punctuation, Other

Sm

Symbol, Math

Sc

Symbol, Currency

Sk

Symbol, Modifier

So

Symbol, Other

Zs

Separator, Space

Zl

Separator, Line

Zp

Separator, Paragraph

Cc

Other, Control

Cf

Other, Format

Cs

Other, Surrogate

Co

Other, Private Use

Cn

Other, Not Assigned (no characters have this property)

The .NET Framework provides additional categories that represent a set of Unicode character categories, as shown in the following table.

Category Represents

C

(All control characters) Cc, Cf, Cs, Co, and Cn.

L

(All letters) Lu, Ll, Lt, Lm, and Lo.

M

(All diacritic marks) Mn, Mc, and Me.

N

(All numbers) Nd, Nl, and No.

P

(All punctuation) Pc, Pd, Ps, Pe, Pi, Pf, and Po.

S

(All symbols) Sm, Sc, Sk, and So.

Z

(All separators) Zs, Zl, and Zp.

Supported Named Blocks

The .NET Framework provides the named blocks listed in the following table. The set of supported named blocks is based on Unicode 4.0 and Perl 5.6.

Code point range Block name

0000 - 007F

IsBasicLatin

0080 - 00FF

IsLatin-1Supplement

0100 - 017F

IsLatinExtended-A

0180 - 024F

IsLatinExtended-B

0250 - 02AF

IsIPAExtensions

02B0 - 02FF

IsSpacingModifierLetters

0300 - 036F

IsCombiningDiacriticalMarks

0370 - 03FF

IsGreek

-or-

IsGreekandCoptic

0400 - 04FF

IsCyrillic

0500 - 052F

IsCyrillicSupplement

0530 - 058F

IsArmenian

0590 - 05FF

IsHebrew

0600 - 06FF

IsArabic

0700 - 074F

IsSyriac

0780 - 07BF

IsThaana

0900 - 097F

IsDevanagari

0980 - 09FF

IsBengali

0A00 - 0A7F

IsGurmukhi

0A80 - 0AFF

IsGujarati

0B00 - 0B7F

IsOriya

0B80 - 0BFF

IsTamil

0C00 - 0C7F

IsTelugu

0C80 - 0CFF

IsKannada

0D00 - 0D7F

IsMalayalam

0D80 - 0DFF

IsSinhala

0E00 - 0E7F

IsThai

0E80 - 0EFF

IsLao

0F00 - 0FFF

IsTibetan

1000 - 109F

IsMyanmar

10A0 - 10FF

IsGeorgian

1100 - 11FF

IsHangulJamo

1200 - 137F

IsEthiopic

13A0 - 13FF

IsCherokee

1400 - 167F

IsUnifiedCanadianAboriginalSyllabics

1680 - 169F

IsOgham

16A0 - 16FF

IsRunic

1700 - 171F

IsTagalog

1720 - 173F

IsHanunoo

1740 - 175F

IsBuhid

1760 - 177F

IsTagbanwa

1780 - 17FF

IsKhmer

1800 - 18AF

IsMongolian

1900 - 194F

IsLimbu

1950 - 197F

IsTaiLe

19E0 - 19FF

IsKhmerSymbols

1D00 - 1D7F

IsPhoneticExtensions

1E00 - 1EFF

IsLatinExtendedAdditional

1F00 - 1FFF

IsGreekExtended

2000 - 206F

IsGeneralPunctuation

2070 - 209F

IsSuperscriptsandSubscripts

20A0 - 20CF

IsCurrencySymbols

20D0 - 20FF

IsCombiningDiacriticalMarksforSymbols

-or-

IsCombiningMarksforSymbols

2100 - 214F

IsLetterlikeSymbols

2150 - 218F

IsNumberForms

2190 - 21FF

IsArrows

2200 - 22FF

IsMathematicalOperators

2300 - 23FF

IsMiscellaneousTechnical

2400 - 243F

IsControlPictures

2440 - 245F

IsOpticalCharacterRecognition

2460 - 24FF

IsEnclosedAlphanumerics

2500 - 257F

IsBoxDrawing

2580 - 259F

IsBlockElements

25A0 - 25FF

IsGeometricShapes

2600 - 26FF

IsMiscellaneousSymbols

2700 - 27BF

IsDingbats

27C0 - 27EF

IsMiscellaneousMathematicalSymbols-A

27F0 - 27FF

IsSupplementalArrows-A

2800 - 28FF

IsBraillePatterns

2900 - 297F

IsSupplementalArrows-B

2980 - 29FF

IsMiscellaneousMathematicalSymbols-B

2A00 - 2AFF

IsSupplementalMathematicalOperators

2B00 - 2BFF

IsMiscellaneousSymbolsandArrows

2E80 - 2EFF

IsCJKRadicalsSupplement

2F00 - 2FDF

IsKangxiRadicals

2FF0 - 2FFF

IsIdeographicDescriptionCharacters

3000 - 303F

IsCJKSymbolsandPunctuation

3040 - 309F

IsHiragana

30A0 - 30FF

IsKatakana

3100 - 312F

IsBopomofo

3130 - 318F

IsHangulCompatibilityJamo

3190 - 319F

IsKanbun

31A0 - 31BF

IsBopomofoExtended

31F0 - 31FF

IsKatakanaPhoneticExtensions

3200 - 32FF

IsEnclosedCJKLettersandMonths

3300 - 33FF

IsCJKCompatibility

3400 - 4DBF

IsCJKUnifiedIdeographsExtensionA

4DC0 - 4DFF

IsYijingHexagramSymbols

4E00 - 9FFF

IsCJKUnifiedIdeographs

A000 - A48F

IsYiSyllables

A490 - A4CF

IsYiRadicals

AC00 - D7AF

IsHangulSyllables

D800 - DB7F

IsHighSurrogates

DB80 - DBFF

IsHighPrivateUseSurrogates

DC00 - DFFF

IsLowSurrogates

E000 - F8FF

IsPrivateUse

F900 - FAFF

IsPrivateUseArea

FB00 - FB4F

IsCJKCompatibilityIdeographs

FB50 - FDFF

IsAlphabeticPresentationForms

FE00 - FE0F

IsArabicPresentationForms-A

FE20 - FE2F

IsVariationSelectors

FE30 - FE4F

IsCombiningHalfMarks

FE50 - FE6F

IsCJKCompatibilityForms

FE70 - FEFF

IsSmallFormVariants

FF00 - FFEF

IsArabicPresentationForms-B

FFF0 - FFFF

IsHalfwidthandFullwidthForms

Character Class Subtraction

A character class defines a set of characters. Character class subtraction yields a set of characters that is the result of excluding the characters in one character class from another character class.

A character class subtraction expression has the following form:

[base_group-[excluded_group]]

The square brackets ([]) and hyphen (-) are mandatory. The base_group is a positive or negative character group as described in the Character Class Syntax table. The excluded_group component is another positive or negative character group, or another character class subtraction expression (that is, you can nest character class subtraction expressions).

For example, suppose you have a base group that consists of the character range from 'a' through 'z'. To define the set of characters that consists of the base group except for the character 'm', use [a-z-[m]]. To define the set of characters that consists of the base group except for the set of characters 'd', 'j', and 'p', use [a-z-[djp]]. To define the set of characters that consists of the base group except for the character range from 'm' through 'p', use [a-z-[m-p]].

Consider the nested character class subtraction expression, [a-z-[d-w-[m-o]]]. The expression is evaluated from the innermost character range outward. First, the character range from 'm' through 'o' is subtracted from the character range 'd' through 'w', which yields the set of characters from 'd' through 'l' and 'p' through 'w'. That set is then subtracted from the character range from 'a' through 'z', which yields the set of characters, [abcmnoxyz].

You can use any character class with character class subtraction. To define the set of characters that consists of all Unicode characters from \u0000 through \uFFFF except white-space characters (\s), the characters in the punctuation general category (\p{P}), the characters in the IsGreek named block (\p{IsGreek}), and the Unicode NEXT LINE control character (\x85), use [\u0000-\uFFFF-[\s\p{P}\p{IsGreek}\x85]].

Choose character classes for a character class subtraction expression that will yield useful results. Avoid an expression that yields an empty set of characters, which cannot match anything, or an expression that is equivalent to the original base group. For example, the empty set is the result of the expression [\p{IsBasicLatin}-[\x00-\x7F]], which subtracts all characters from the IsBasicLatin general category. Similarly, the original base group is the result of the expression [a-z-[0-9]]. This is because the base group, which is the character range of letters from 'a' through 'z', does not contain any characters in the excluded group, which is the character range of decimal digits from '0' through '9'.

Note that XML Schema Regular Expressions has similar support for character class subtraction.

See Also

Reference

GetUnicodeCategory
Regular Expression Options

Other Resources

Regular Expression Language Elements