Thursday, January 7, 2016

Demystifying Regex Redux with 9 simple terms

Can I have your number?
When I wrote demystifying regex with 7 simple terms a while ago, I left out a couple really useful regex terms. So I guess this would have to be re-written as 9 Simple Terms.

Regex is one of those things that when you need it, you need it. But it is from the 80's and cryptic. Most regex expressions you see are too complex, and hard to follow. In this post, I'll show you a couple more terms to help you keep it simple and to a minimum, while allowing you to tap the power of regex.

expect-lite has very good support for regex meta characters (the ones that start with a backslash "\"). As a quick review of the 7 terms, there are:
  • Repeats: * and +
  • Meta characters: \d, \w, \n, \t
  • Or: |
But there are a couple of regex meta characters which I have found useful in addition to the 7 above, when skipping over some columnar info to get that column you want to validate (or capture into a variable).
\s is whitespace (space, tab, or newline)
\S is not whitespace

Working with the example from demystifying regex with 7 simple terms:
$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface   U     0      0        0 eth0     U     1000   0        0 eth0         UG    100    0        0 eth0

You could match the first IP address:
>route -n

But what if you wanted to match the default route metric (on the last line) , rather than the first IP address, you could use (I have highlighted the value we are searching for in bold):
#validate metric
>route -n

This introduces a new meta character, \s which is a space (or any white space). But it starts to look complex. Keeping it simple, use the non-space \S, and back tracking from a known point (in this example 'eth0') will simplify the regex some:
#validate metric
>route -n

It is better, but could be simpler by keying off of the flags column.
#validate metric of default route
>route -n

This uses the space meta-character, \s, and uses another meta-character mentioned in the fine print of the original post, the dot, or '.' which matches any character. It is good to use the dot sparingly, as it can often match more than you would expect. But in this example, because there is only one default route (it is after all, only IPv4*), and it is always on the last line, it is pretty safe to use the dot.

To see just what expect-lite did match, use the *EXP_INFO directive on the CLI or earlier in your script.

Regex Guidelines

It is a good idea to keep regex as simple as possible. As we saw above, it is easy to create complex regex, but that leads to challenges in maintaining code later. Every time someone has to debug the script, they have to figure out what the regex is doing. Shorter, simpler regex will always win out.

Regex has the concept of anchors (^,$), but I haven't included them the 9 simple terms because of a couple of reasons:

  • Anchors don't work as you would expect in expect-lite. One would expect that you could use an anchor at the beginning of a line, but expect-lite doesn't evaluate output on a line by line basis, but rather a blob of text which includes new-lines. Therefore, if you need to "anchor" your regex, do something like '\n169.254.0.0' Regexs with anchors tend to not be simple or short
  • I have seen regexs where the entire line is described, from beginning of the line to the end of the line, with anchors at each end. This almost always makes a very complex and brittle regex. A change in the column width, can break these kinds of regexes. Rather, it is much easier, and less code intensive to do a sparse validation of output using simple regexes (as shown in the example above).

Not everyone is a regex expert. Plan on helping the next person who looks at your script by writing a comment about what the regex is doing. And if you are lucky enough to be the next person to look at your code, then you will be thankful that you wrote your future-self a note.

Recap the 9 terms

To recap, and give you a single place to look for a reference, the 9 terms are the single character meta-characters:

  • \d  is a number
  • \w  is a letter
  • \n  is a new line (think of it as a carriage return)
  • \t  is a tab
  • \s is white space (including \t and \n)
  • \S is a non-space (any letter, number, symbol)
  • . is any character (use this sparingly**)

And the repeat characters which are modifiers to the terms above:

  • *  repeats 0 or more times
  • +  repeats 1 or more times

And the regex OR term, |

The power of 9

You can still use only the original 7 regex terms and accomplish 90% of what you need. The additional 2 meta-characters just give you a bit more control over matching. And for those of us with a finite memory, it is still fewer than the fingers on two hands.

* IPv6 can often have multiple default routes, and the metric becomes very important in determining which one is used.
** the regex dot is extra credit

No comments:

Post a Comment