Windows PowerShell

The Case for Regular Expressions

Don Jones

I n the past, I’ve written about regular expressions in Windows PowerShell, mostly from the perspective of how they work and how you can use them. This month, I’m focusing on a real-world, practical application of regular expressions in the shell. Based on a customer solution I created, this is a great example of regular expressions’ power.

A Regular Problem

The problem went something like this: I needed to use Windows PowerShell to retrieve the text of a Web page. Keep in mind that the Web page transfers as a simple text document with HTML instructions on how the page should be rendered in a Web browser. From that text, I needed to extract all hyperlinks, display them as a list and either output them to a local text file or save them in some other fashion. In HTML, a hyperlink is indicated by the <a> tag, and might look like this: https://concentratedtech.com Click here to visit

A difficulty arises in that the <a> tag supports a number of optional parameters, such as target, which forces the link to open in a new page. Sometimes, the <a> tag can exist without the href parameter, instead including a name parameter that establishes an in-page anchor. I specifically didn’t want to capture those; I only wanted bona fide outbound links to a different page.

While I’ll continue to cover features and techniques available in PowerShell v1 in my column, more and more I’ll focus on features unique to v2. PowerShell v2 ships with, and is preinstalled in, Windows 7 and Windows Server 2008 R2. By the time you read this, or soon after, PowerShell v2 should be available for Windows Vista, Windows Server 2008, Windows XP and Windows Server 2003. Visit Microsoft.com/PowerShell to check for availability and download links.

A Regular Puzzle

Working with regular expressions can be a lot like solving a puzzle. Just as if you were staring at one of those three-dimensional posters, squinting helps. You have to view the information in the form of a pattern, rather than as individual characters. Squinting can help blur the characters so you can focus on the larger pattern. Consider these four hyperlinks: https://concentratedtech.com. Click here to visit.

<a name="data">Data Sheet</a>

<a target="_blank" href="https://microsoft.com">Microsoft</a>

<a href="search.aspx" target="_top">Search</a>

I want to capture links that have these common elements:

  • They all start with <a
  • They all end with </a>
  • They all contain href=" somewhere after the <a and before the >

I don’t want to capture a link that doesn’t contain all of these elements. Ignoring the bits I don’t care about, and squinting, the links look like this:

<a_href=”xxxxxxxxx">xxxxxxx</a>

<a_xxxxxxxx>xxxxxxx</a>

<a_xxxxxxxxxx_href="xxxxxxxxx">xxxxxxxx</a>

<a_href="xxxxxxxxx"_xxxxxxxxxx>xxxxxxxx</a>

Notice that I’ve replaced the space character with an underscore just to make it stand out a bit more, and replaced the stuff I don’t care about with “x.” Suddenly, these start to look a lot more similar, and a pattern emerges.

A Regular Pattern

Patterns are the whole point of regular expressions. Using the regular expression language, you describe the pattern of text for which you’re looking. You can get a pretty comprehensive description of that language by running Help about_regular_expressions in Windows PowerShell.

In my case, my regular expression might look something like this:

(<a\s.{0,}?href=".+?".{0,}?>.+?</a>)

I know—it’s crazy. Getting it right took me about an hour, with a huge amount of help from the Web site RegExTester.com. Let me break it down:

  • The (parentheses) define a pattern for a single hyperlink, and tell the shell to capture the matches (more on this later).
  • The <a is a literal match; the shell will look for those two characters.
  • \s means to match a single whitespace character, like a space, which always should follow <a in HTML.
  • .{0,}? means I want to see zero or more characters of any kind. The period means “any character,” and the {0,} means zero or more. The trailing question mark is special. It makes the match non-greedy (more on that in a second).
  • Next, I want to see the literal characters href="
  • Next, I want to see one or more of any character. The plus sign means “one or more,” and, once again, a trailing question mark makes it non-greedy.
  • Following that, I want to see a closing quotation mark.
  • Then I want to see zero or more characters and the closing >.
  • Lastly, I want to see one or more characters before the </a>. Here, too, the question mark after the plus sign makes this a non-greedy match.

This non-greedy business is tricky. Let’s say I have the following text string in my HTML page:

This is a <a href="test">link</a> and this is a <a href="test">link</a> but this is not.

And let’s say I’m using this regular expression, which does not use the question marks to create non-greedy matches:

(<a\s.{0,}href=".+".{0,}>.+</a>)

The shell will match the first <a and space, and then look for zero or more characters—ending only when it finds the last </a>. I’ll use boldface to show what it matches:

This is a <a href="test">link</a> and this is a <a href="test">link</a> but this is not.

That’s because the first .{0,} match is greedy: The shell consumes as many characters as possible while making the regular expression work. By making it non-greedy, I tell it to consume as few characters as possible while making the regular expression work:

This is a <a href="test">link</a> and this is a <a href="test">link</a> but this is not.

A Regular Solution

To put this to work in the shell, I started by defining a test variable and filling it with HTML. I then used the -match operator to match the test HTML against my regular expression.

PS C:\> $html = 'This is a <a href="test">test</a> but <a name="anchor">This</a> is not and <a target="_top" href="something">definitely is</a> a link.'

PS C:\> $html -match '(<a\s.{0,}?href=".+?".{0,}?>.+?</a>)'

True

The “True” result simply tells me that the shell found one or more matches; it doesn’t tell me what it matched. However, after using the -match operator, the shell automatically creates an array, called $matches. This array contains all of the captured matches:

PS C:\> $matches

Name Value
1 <a href="test">test</a>
0 <a href="test">test</a>

I can access individual matches by using normal array syntax:

PS C:\> $matches[0]

<a href="test">test</a>

I can easily send those to a file, too:

PS C:\> $matches | out-file c:\matches.txt

 

Once you’ve mastered the syntax, as bizarre as it is, regular expressions provide a powerful and valuable way of matching even complex, variable patterns in a large body of text. You can extract the text that matches your pattern and work with it independently of the main body of text. This is especially useful for parsing log files, HTML files or any other kind of semi-structured text.

 

Don Jones is a contributing editor for TechNet Magazine*, and publishes Windows PowerShell tips and news at www.ConcentratedTech.com. He is the co-author of “Windows PowerShell: TFM” (Sapien Press, 2009), now in its third edition and covering Windows PowerShell version 2.*