Windows PowerShell Writing Regular Expressions

Don Jones

192.168.4.5. \\Server57\Share. johnd@contoso.com. You no doubt recognize these three items as an IP address, a Universal Naming Convention (UNC) path, and an e-mail address. Your brain recognizes their formats. Four groupings of digits, backslashes, the @ symbol, and other cues indicate what types of data these strings of

characters represent. With little thought, you can quickly recognize that 192.168 on its own isn't a valid IP address, that 7\\Server2\\Share isn't a valid UNC, and that joe@contoso isn't a valid e-mail address.

Computers, unfortunately, have to work a bit harder in order to "understand" complicated formats like these. That's where regular expressions come into play. A regular expression is a string, written using a special regular expression language, that helps a computer identify strings that are of a particular format—such as an IP address, a UNC, or an e-mail address. A well-written regular expression has the ability to allow a Windows PowerShellTM script to accept as valid or reject as invalid data that does not conform to the format you've specified.

Making a Simple Match

The Windows PowerShell –match operator compares a string to a regular expression, or regex, and then returns either True or False depending on whether the string matches the regex. A very simple regex doesn't even need to contain any special syntax—literal characters will suffice. For example:

"Microsoft" –match "soft"
"Software" –match "soft"
"Computers" –match "soft"

When run in Windows PowerShell, the first two expressions return True while the third returns False. In each, a string is followed by the –match operator, which is followed by a regex. By default, a regex will float across a string to find a match. The characters "soft" can be found within both Software and Microsoft, but at different positions. Also notice that, by default, a regex is case-insensitive—"soft" is found in "Software" despite its capital S.

But if necessary, a different operator, –cmatch, offers case-sensitive regex comparison, like so:

"Software" –cmatch "soft"

This expression returns False since the string "soft" doesn't match "Software" in a case-sensitive comparison. Note that the –imatch operator is also available as an explicit case-insensitive option, despite that being the default behavior of –match.

Wildcards and Repeaters

A regex can contain a few wildcard characters. A period, for example, matches one instance of any character. A question mark matches zero or one instance of any character. Here are some examples to illustrate:

"Don" –match "D.n" (True)
"Dn" –match "D.n" (False)
"Don" –match "D?n" (True)
"Dn" –match "D?n" (True)

In the first expression, the period stands in for exactly one character, so the match is True. In the second expression, the period doesn't find the one character that it requires to be included, and so the match is False. The question mark, as shown in the third and fourth expressions, can match a single unknown character or no character at all. Finally, in the fourth example, the match is True because both "D" and "n" are found without a character between them. Thus, the question mark can be thought of as standing for an optional character, so the match is still True even if no character appears in that position.

A regex also recognizes the * and + symbols as repeaters. These need to follow some character or characters. The * matches zero or more of the specified characters, while the + matches one or more of the specified characters. Here are some examples:

"DoDon" –match "Do*n" (True)
"Dn" -match "Do*n" (True)
"DoDon" -match "Do+n" (True)
"Dn" -match "Do+n" (False)

Notice that both * and + are matching "Do", not just the "o". That's because these repeaters are designed to match a series of characters, not just one character.

What if you need to match the period, *, ?, or + symbols themselves? You simply precede them with a backslash, which is the regex escape character:

"D.n" -match "D\.n" (True)

Notice that this is different from the Windows PowerShell escape character (the backward apostrophe), but it follows industry-standard regex syntax.

Character Classes

A character class is a broader form of wildcard, representing an entire group of characters. Windows PowerShell recognizes quite a few character classes. For instance:

  • \w matches any word character, meaning letters and numbers.
  • \s matches any white space character, such as tabs, spaces, and so forth.
  • \d matches any digit character.

There are also negative character classes: \W matches any non-word character, \S matches non-white space characters, and \D matches non-digits. These classes can be followed by * or + to indicate that multiple matches are acceptable. Here are some examples:

"Shell" -match "\w" (True)
"Shell" -match "\w*" (True)

Cmdlet of the Month

The Write-Debug cmdlet is very handy for writing objects (such as text strings) to the Debug pipeline. Trying this cmdlet in the shell can be somewhat disappointing, though, because it doesn't look like the cmdlet is doing anything.

The trick is that the Debug pipeline is shut off by default—the $DebugPreference variable is set to "SilentlyContinue." Set it to "Continue," however, and everything you send with Write-Debug will appear at the console in yellow text. This is a perfect way to add trace code to your scripts, allowing you to follow the execution of a complex script. The yellow color helps you distinguish between trace about and the script's normal output, and you can shut off the debug messages at any time without having to remove all the Write-Debug statements. Simply set $DebugPreference = "SilentlyContinue" again and the debug text will be suppressed.

Though both expressions return True, they're matching significantly different things. Fortunately, there's a way to see what the –match operator is thinking under the hood: each time a match is made, a special variable called $matches is populated with the results of the match—that is, whatever characters in the string the operator matched against your regex. The $matches variable retains its results until another positive match is made using the –match operator. Figure 1 shows the difference between the two expressions I just showed you. As you can see, \w matched the "S" in "Shell", while the repeating \w* matched the entire word.

Figure 1 What a difference a * can make

Figure 1** What a difference a * can make **(Click the image for a larger view)

Character Groups, Ranges, and Sizes

A regex can also contain groups or ranges of characters, enclosed in square brackets. For example, [aeiou] means that any one of the included characters—a, e, i, o, or u—is an acceptable match. [a-zA-Z] indicates that any letter in the range a-z or A-Z is acceptable (although if you're using the non-case-sensitive –match operator, just a-z or A-Z on its own would be sufficient). Here's an example:

"Jeff" -match "J[aeiou]ff" (True)
"Jeeeeeeeeeeff" -match "J[aeiou]ff" (False)

You can also specify a minimum and maximum number of characters using curly braces. {3} indicates that you want exactly three of the specified character, {3,} means that you want at least three or more, and {3,4} indicates that you want at least three but no more than four. This is an ideal way to create a regex for IP addresses:

"192.168.15.20" -match "\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}" (True)

This regex wants four groups of digits with one to three digits each, all of which are separated by a literal period. But consider this example:

"300.168.15.20" -match "\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}" (True)

This shows the limitations of a regex. While the formatting of this string looks like an IP address, it obviously isn't a valid IP address. A regex can't determine whether data is actually valid; it can only determine whether the data looks right.

Stop the Float

Troubleshooting a regex can be tricky. For example, here's a regex that tests for a UNC path in the format \\Server2\Share:

"\\Server2\Share" -match "\\\\\w+\\\w+" (True)

Here, the regex itself is difficult to read because every literal backslash I want to test for has to be escaped with a second backslash. Though this seems to work fine, it really doesn't:

"57\\Server2\Share" -match "\\\\\w+\\\w+" (True)

This second example is clearly (to you and me, at least) not a valid UNC path, but the regex gave it the all clear. Why? Remember that a regex will float by default. This regex merely looks for two backslashes, one or more letters and numbers, another backslash, and more letters and numbers. That pattern exists in the string—along with the extra digits at the start, which make it an invalid UNC. The trick is to tell the regex to start matching at the beginning of the string, without floating. I can do that like this:

"57\\Server2\Share" -match "^\\\\\w+\\\w+" (False)

The ^ character indicates that this is the location where the string begins. With that addition, the invalid UNC path fails because the regex is looking for the first two characters to be backslashes, and in this case they aren't.

Similarly, the $ symbol can be used to indicate the end of a string. This wouldn't be very useful in the case of a UNC path since a UNC path can contain additional path segments, such as \\Server2\Share\Folder\File, for example. However, I'm sure there are many cases where you would want to specify the end of a string.

Help with Regular Expressions

In Windows PowerShell, the about_regular_expressions help topic provides basic syntax assistance for the regex language, but online resources can provide even more information. For instance, one of my favorite Web sites, www.RegExLib.com, offers a free library of regular expressions that have been written for various purposes and contributed to by the public. You can search on keywords, such as "e-mail" or "UNC," to quickly locate a regex that suits your need—or at least provides a good starting point. If you manage to create a great regex, you can contribute it to the library so others can make use of it, too.

I also like RegexBuddy (www.RegexBuddy.com). This is an inexpensive tool that provides a graphical regex editor. RegexBuddy makes it easier to assemble a complex regex, and this tool also makes it easier to test a regex to ensure that it is properly accepting valid strings and rejecting invalid ones. A number of other software developers have also created free, shareware, and commercial regex editors and testers that users will surely find to be useful.

Using Regular Expressions

You may be wondering why you would use a regex in real life. Imagine you're reading information from a CSV file and using the information to create new users in Active Directory®. If the CSV file is generated by someone else, you'll want to validate that the data in it looks right. A regex is perfect for this task. A simple regex like \w+, for example, can confirm that first and last names don't contain any special characters or symbols, while something more complicated can confirm that the e-mail addresses conform to your corporate standard. For example, you could use this:

"^[a-z]+\.[a-z]+@contoso.com$"

This regex requires an e-mail address in the form don.jones@contoso.com, where the first and last names can only contain letters, and where they must be separated by a period. E-mail addresses, by the way, are the trickiest strings to write a regex for. If you can narrow your scope down to a specific corporate standard, you'll have an easier time of it.

Don't forget those start and end anchors (^ and $), which ensure that nothing follows contoso.com and also that nothing precedes the characters that make up the user's first name.

Actually using this regex in Windows PowerShell is pretty easy. Assuming the variable $email contains the e-mail address you read from the CSV file, something like this will check to see whether it's valid or not:

$regex = "^[a-z]+\.[a-z]+@contoso.com$"
If ($email –notmatch $regex) {
  Write-Error "Invalid e-mail address $email" 
}

And in this example you've learned a new operator. -notmatch returns True if the string doesn't match the provided regex. (There is also a –cnotmatch for case-sensitive comparisons.)

There's a lot more about regular expressions I haven't covered here—additional character classes, more advanced operations, and even an operator or two. And then there's the [regex] object type that Windows PowerShell supports. However, what I have covered in this quick overview of regex syntax should be enough to get you started. Feel free to visit me anytime at www.ScriptingAnswers.com if you need help puzzling through an especially tricky regex.

Don Jones is a contributing editor for TechNet Magazine and is the coauthor of Windows PowerShell: TFM (SAPIEN Press). He teaches Windows PowerShell (www.ScriptingTraining.com) and can be reached through the ScriptingAnswers.com Web site.

© 2008 Microsoft Corporation and CMP Media, LLC. All rights reserved; reproduction in part or in whole without permission is prohibited.