Hey, Scripting Guy! Raising Eyebrows on Regular Expressions

The Microsoft Scripting Guys

Download the code for this article: HeyScriptingGuy2008_05.exe (150KB)

Back in November 2007 , the Scripting Guys got to spend a day in Paris while en route to the Tech•Ed IT Forum conference held in Barcelona. During our one-day layover, we took the opportunity to visit the Louvre, the city's world-famous museum of art.

So how did the Scripting Guys find the Louvre? That's easy: we walked up to Notre Dame and then took a left.

Oh, you mean did we enjoy the Louvre? For the most part, yes. The only problem we had is that the Louvre, like many museums, operates on a look-but-don't-touch basis. Everyone knows that the Mona Lisa would look way better if she had eyebrows, but, for some reason, the people who run the Louvre get very upset if you try to fix any of the paintings.

Note: actually, both of the Scripting Guys liked the Mona Lisa. That was a pleasant surprise, too; after all the hype and buildup we were afraid it might be just another painting. But it wasn't; it was very cool. (Although it could use a few eyebrows.) Interestingly, however, we were both disappointed in the equally hyped Venus de Milo. Neither of us found the workmanship to be all that spectacular, and the Scripting Guy who writes this column was perplexed by the whole idea of the Venus de Milo. A statue of a woman who has no arms? How is she supposed to mop the floor or wash the dishes!?!

Note to our female readers (assuming there are any left): obviously that was a typo. What we meant to say was this: a statue of a woman who has no arms? And yet she can still do twice the work of any man, and do the work correctly to boot.

The Scripting Guys apologize for any misunderstanding our original statement might have created.

At any rate, at the moment we set eyes on the great treasures of the Louvre, the two Scripting Guys were struck by an identical thought: where are the restrooms? While searching, the Scripting Guy who writes this column had another thought: the Scripting Guys are hypocritical. After all, we're miffed that the Louvre won't let us fix the Mona Lisa. And yet, we're guilty of a similar sin. In the January 2008 issue of TechNet Magazine, we wrote an article about using regular expressions in a script. That stands as a prime example of look-but-don't-touch: we showed you how to use regular expressions to identify problems in a text file, but we didn't show you how to fix those problems. Zut alors!

Note: so if the Scripting Guys went to the Louvre in November 2007, how could the Scripting Guy who writes this column have had a sudden thought regarding an article that didn't appear in TechNet Magazine until January 2008? Wow; that is a conundrum, isn't it? Must have something to do with time zone differences between Redmond and Paris.

Fortunately, though, and unlike the folks who run the Louvre, the Scripting Guys are willing to admit when they've made a mistake. We were wrong to show you how to only search for things using regular expressions; we should have also shown you how to replace things using regular expressions. In fact, we should have shown you a script like the one in Figure 1.

Figure 1 Search and replace

      Set objRegEx = _
    CreateObject("VBScript.RegExp")

objRegEx.Global = True   
objRegEx.IgnoreCase = True
objRegEx.Pattern = "Mona Lisa"

strSearchString = _
    "The Mona Lisa is in the Louvre."
strNewString = _
    objRegEx.Replace(strSearchString, _
                     "La Gioconda")

Wscript.Echo strNewString 

Now, to tell you the truth, this is actually a pretty mundane use of regular expressions: all we're doing here is replacing all instances of the string value Mona Lisa with La Gioconda (Italian for "Now where did I put those eyebrows?"). Admittedly, we could have performed this replacement much more easily just by using the VBScript Replace function. But, fear not: we'll use this simple little script to explain how to perform search-and-replace operations using regular expressions, and then once that's done, we'll show you a few of the fancier things you can do with these expressions.

As you can see, there's really not much to this script. We begin by creating an instance of the VBScript.RegExp object; needless to say, that's the object that enables us to use regular expressions within a VBScript script. After creating the object, we then assign values to three of the object's properties:

Global By setting this property to True, we're telling the script to search for (and replace) every instance of Mona Lisa it finds in the target text. If the Global property were False (the default value), the script would search for and replace only the first instance of Mona Lisa.

IgnoreCase Setting IgnoreCase to True tells the script that we want to perform a case-insensitive search; in other words, we want to treat mona lisa and Mona Lisa as being identical. By default, VBScript does a case-sensitive search, meaning that—thanks to the uppercase and lowercase letters—mona lisa and Mona Lisa are seen as being completely different values.

Pattern The Pattern property holds the value that we're looking for. In this case, we're just looking for a simple string value: Mona Lisa.

Next, we assign the text we want to search to a variable named strSearchString:

strSearchString = "The Mona Lisa is in the Louvre."

Then we call the regular expression method Replace, passing this method two parameters: the target text we want to search (the variable strSearchString) and the replacement text (La Gioconda). That's what we're doing here:

strNewString = objRegEx.Replace(strSearchString, "La Gioconda")

And that's it. The modified text gets stored in the variable strNewString. If we now echo back the value of strNewString, we should get the following:

The La Gioconda is in the Louvre.

The grammar might be a little questionable, but you get the general idea.

As we noted earlier, that's all well and good, but it's definitely overkill; we could have accomplished the exact same thing by using these lines of code (in fact, we could even do it all in one line if we wanted to):

strSearchString = "The Mona Lisa is in the Louvre."
strNewString = Replace(strSearchString, "Mona Lisa", "La Gioconda")
Wscript.Echo strNewString

In other words, let's see if we can do something interesting with regular expressions that we can't do with the Replace function in VBScript.

No one has any ideas, huh? Well, here's one. Often we Scripting Guys end up having to copy text from one type of document to another. Sometimes that works pretty well; sometimes it doesn't. When it doesn't work, we often get odd problems with word spacing, problems that result in text that looks like the text below:

Myer Ken, Vice President, Sales and Services

Yikes; just look at all those extraneous blank spaces! And, in this case, the Replace function is of limited use. Why? Well, we have a seemingly random number of extraneous blank spaces: there might be 7 blank spaces between words, there might be 2 blank spaces between words, or there might be 6 blank spaces between words. That makes it difficult to fix the problem using Replace. For example, if we try searching for 2 consecutive blank spaces (replacing those 2 with a single blank space) we end up with this:

Myer Ken, Vice President, Sales and  Services

That's better, but not by much. There is a way we could do this, but it requires us to search for an arbitrary number of blank spaces (say, 39); make the replacement; subtract 1 from the starting number; search for 38 blank spaces; make the replacement; and so forth, and so on. Alternatively, we could use this far simpler (and far more foolproof) regular expressions script:

Set objRegEx = CreateObject("VBScript.RegExp")

objRegEx.Global = True
objRegEx.Pattern = " {2,}"

strSearchString = _
"Myer Ken, Vice President, Sales and Services"
strNewString = objRegEx.Replace(strSearchString," ")

Wscript.Echo strNewString

The key to this script (and the key to most regular expression scripts) is the Pattern:

objRegEx.Pattern = " {2,}"

What we're doing here is looking for 2 (or more) consecutive blank spaces. How do we know that this Pattern looks for 2 (or more) blank spaces? Well, inside our double quote marks we have a single blank space followed by this construction: {2,}. In regular expressions syntax, that says look for at least 2 consecutive instances of the preceding character (in this case, a blank space). And what if there are 3 or 4 or 937 consecutive blanks spaces? That's fine; you would grab all of those as well. (If, for some reason, we wanted to grab at least 2 blank spaces but no more than 8, then we'd use the syntax {2,8}, with the 8 specifying the maximum number of matches.)

In other words, any time we find 2 or more blank spaces, one right after another, we're going to grab all those consecutive spaces and replace them with a single blank space. What will that do to our original string value, the one with all the extraneous blank spaces? This:

Myer Ken, Vice President, Sales and Services

See? The Scripting Guys really can make things better. Now if only the folks at the Louvre would give us a crack at the Mona Lisa.

Here's an interesting—and not uncommon—scenario. Suppose your company has a telephone directory, and all the phone numbers are formatted like this:

555-123-4567

Now, however, your boss has decided that all the phone numbers should be formatted to look like this:

(555) 123-4567

How in the world are you supposed to reformat those phone numbers? Well, if we might be so bold, may we suggest that you use a script similar to this one:

Set objRegEx = CreateObject("VBScript.RegExp")

objRegEx.Global = True
objRegEx.Pattern = "(\d{3})-(\d{3})-(\d{4})"

strSearchString = "555-123-4567"
strNewString = objRegEx.Replace _
(strSearchString, "($1) $2-$3")

Wscript.Echo strNewString

What we're doing here is searching for 3 digits (\d{3}) followed by a dash, followed by 3 more digits and a dash, followed by 4 digits. In other words, we're searching for the following, where each X represents a number from 0 to 9:

XXX-XXX-XXXX

Note: so how did we know that \d{3} would tell the script to look for three numbers back-to-back-to-back? Well, as best we can recall, we read it somewhere. In fact, it had to be either the shocking final chapter of The Da Vinci Code or the VBScript Language Reference on MSDN® online (see go.microsoft.com/fwlink/?LinkID=111387).

Now, it's pretty cool that we can search for an arbitrary phone number using regular expressions. We still have a major problem here, however. After all, we can't replace that arbitrary phone number with an equally arbitrary phone number; instead, we have to use the exact same phone number, just formatted a little bit differently. How in the world do we do that?

Why, by using the following replacement text:

"($1) $2-$3"

$1, $2, and $3 are examples of a regular expression "back reference." A back reference is simply a portion of the found text that can be saved and then reused. In this particular script, we're looking for three "sub-matches":

  • A set of 3 digits
  • A set of 3 more digits
  • A set of 4 digits

Each of these sub-matches is automatically assigned a back reference: the first sub-match is $1; the second is $2; and so on, all the way through $9. In other words, in this script the three parts of our phone number are automatically assigned the back references shown in Figure 2.

Figure 2 Phone number back references

Phone Number Portion Back Reference
555 $1
123 $2
4567 $3

In our replacement string, we use these back references to ensure that the correct phone number is reused. Our replacement text simply says this: take the first back reference ($1) and enclose it in parentheses. Leave a space, then insert the second back reference ($2) followed by a dash. Finally, tack on the third back reference ($3).

What will all that give us? That will give us a phone number that looks like this:

(555) 123-4567

Not bad, not bad at all.

Here's a variation on the phone number script. Suppose your organization has installed a brand new phone system and, as part of the changeover, all your phone numbers will now have the same prefix; where numbers might have originally started with 666, 777, or 888 all numbers will now start with 333. Can we reformat the phone numbers and change the phone number prefix, all at the same time? Why, of course we can:

Set objRegEx = CreateObject("VBScript.RegExp")

objRegEx.Global = True
objRegEx.Pattern = "(\d{3})-(\d{3})-(\d{4})"

strSearchString = "555-123-4567"
strNewString = objRegEx.Replace _
(strSearchString,"($1) 333-$3")

Wscript.Echo strNewString

See what we did here? We simply removed the old prefix (back reference $2) in our replacement text; in its place we substituted the hardcoded, and now standard, prefix value 333. What will the phone number 555-123-4567 look like after we run this modified script? It should look a whole heck of a lot like this:

(555) 333-4567

Here's another common use for back referencing. Suppose we have a string value that looks like this:

Myer, Ken

Is there any way to flip that value around and display the name like this:

Ken Myer

Well, we'd look pretty silly if there wasn't, wouldn't we? Here's a script that does that very thing:

Set objRegEx = CreateObject("VBScript.RegExp")

objRegEx.Global = True
objRegEx.Pattern = "(\S+), (\S+)"

strSearchString = "Myer, Ken"
strNewString = objRegEx.Replace _

strSearchString,"$2 $1")

Wscript.Echo strNewString

In this particular script, we're looking for a word—(\S+)—followed by a comma, followed by a blank space, followed by another word. (In this case, we're using \S+ to represent a "word.") The construction \S+ means any consecutive set of non-white-space characters. In other words, we could have a letter, a number, a symbol; in fact, we could have just about anything other than a space, a tab, or a carriage return-linefeed. As you can see, we expect to find two sub-matches here: one representing the last name ($1) and one representing the first name ($2). Because of that, we can display the user name as FirstName LastName by using this syntax:

"$2 $1"

Where's the comma? Well, obviously we didn't need it, so we simply threw it out.

Note: that's funny; for some reason we started thinking about the Scripting Editor, too. Hmmm ...

Let's show you one more before we call it a day. (Well, OK, before we call it a month.) This is not 100 percent foolproof; after all, we don't want an introductory article like this one to get too complicated. (And regular expressions do have the potential to get extremely complicated.) Nevertheless, here's a script that—in most cases—will remove the leading zeroes from a value like 0000.34500044:

Set objRegEx = CreateObject("VBScript.RegExp")

objRegEx.Global = True
objRegEx.Pattern = "\b0{1,}\."

strSearchString = _
"The final value was 0000.34500044."
strNewString = objRegEx.Replace _
strSearchString,".")

script.Echo strNewString

As usual, the only reason this works is because of the pattern: "\b0{1,}\." We start off looking for a word boundary (\b); that ensures that we don't remove the zeroes in a value like 100.546. We then look for one or more zeroes—0{1,}—followed by a decimal point (\.). If the pattern is found, we replace those zeroes (and the decimal point) with a single decimal point ".". If all goes according to plan, that's going to transform our string into this:

The final value was .34500044.

That's about all the time we have for this month. Before we go, we might note that, since before the paint even dried, the Mona Lisa has been the subject of considerable controversy. Who is the mysterious woman? Why is she smiling like that? Why doesn't she have eyebrows? Several art historians have suggested that Mona Lisa isn't even a woman, that the painting is—instead—a self-portrait of Leonardo da Vinci. (If that's true, he could really stand to find a new tailor.) Meanwhile, the Unarius Educational Foundation has gone a step further, claiming that the painting is actually that of Leonardo's "twin soul" in the "higher world," and that this twin soul guided Leonardo's hand. By remarkable coincidence, that's exactly how this month's Hey, Scripting Guy! was written.

Which means that all complaints should be addressed to the-twin-soul-of-the-scripting-guy-who-writes-that-column@the-other-microsoft.com. Thank you.

Dr. Scripto's Scripting Perplexer

The monthly challenge that tests not only your puzzle-solving skills, but also your scripting skills.

May 2008: Script-doku

This month, we're playing Sudoku, but with a bit of a twist. Instead of the numbers 1 through 9 in the grid, we have the letters and symbols that make up a Windows PowerShell™ cmdlet. In the final solution, one of the rows read across will spell out the cmdlet name.

Note: if you don't already know how to play Sudoku, there are probably several thousand Web sites on the Internet to get the instructions, so we're not going to repeat them here, sorry.

ANSWER:

Dr. Scripto's Scripting Perplexer

Answer: Script-doku, May 2008

The Microsoft Scripting Guys work for—well, are employed by—Microsoft. When not playing/coaching/watching baseball (and various other activities) they run the TechNet Script Center. Check it out at www.scriptingguys.com.

© 2008 Microsoft Corporation and CMP Media, LLC. All rights reserved; reproduction in part or in whole without permission is prohibited.