Friday, September 14, 2012

Demystify Regex with 7 simple terms

Regular Expressions aren't for everyone. Regex is a powerful cryptic conflagration of characters which mean something, if only you could figure out what.

But what if there were a handful of regex terms which did 90% of what you needed. Then you could harness the immense power of regex, without having to learn a whole new language. After all, expect-lite is about making it easy.

Regex Whats

Regex is made up of two parts, what to look for, and does that "thing" repeat. The whats are characters which describe a number, a letter or a non-printing character (such as tab). The four terms you will want to know start with a back-slash and are followed by a single letter:
\d  is a number
\w  is a letter
\n  is a new line (think of it as a carriage return)
\t  is a tab

Regex Repeats

Repeats are useful for finding a string of numbers, for example 123456.  Two terms for the repeats are:
*  repeats 0 or more times
+  repeats 1 or more times

Regex matching with expect-lite

With these six terms you can create very useful regex expressions. The following example shows the output of the route command:
$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
10.1.1.0        0.0.0.0         255.255.255.0   U     0      0        0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U     1000   0        0 eth0
0.0.0.0         10.1.1.1        0.0.0.0         UG    100    0        0 eth0

You could use the following to match the interface:
>route -n
<eth\d

You could also match the first IP address:
>route -n
<\d+.\d+.\d+.\d+

Using the plus, +, means regex will match any digit repeating 1 or more times. An IP addresses can have 1 to 3 digits per octet (the number between the dots). The repeat makes it easy to match a variable number of digits.

Or even more useful would be to use a dynamic variable to grab the default gateway:
>route -n
<\n0.0.0.0
+$default_gw=(\d+.\d+.\d+.\d+)
 
By using the new line, the expect line will only match the 0.0.0.0 at the beginning of the line. I'll write more later about how to leverage expect-lite's capture buffer, but in this example, the <\n0.0.0.0 positions expect-lite to capture into a dynamic variable the very next thing that matches the regex \d+.\d+.\d+.\d+ which in this example the value of $default_gw would be 10.1.1.1

The regex OR

The seventh term of regex that is good to know is the OR command which is the vertical line or pipe, |
The pipe allows you to make a statement, such as mach this OR that. In expect-lite it would look like:
>echo "this"
<this|that

The above example is a bit contrived, but it is common to find the output of a command which might be true OR false, or enabled OR disabled, or UP or DOWN. This may be less useful for a simple match, but very useful in capturing a dynamic variable. The following command shows the interface state on a linux machine:
$ ip link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:30:65:96:b5:4a brd ff:ff:ff:ff:ff:ff


To capture that state of the interface (UP or DOWN) in expect-lite, it would as simple as:
>ip link show eth0
+$intf_state=(UP|DOWN)

By using a simple, yet powerful regex, it is easy to capture the states of the interface in the example above.

The power of 7

I have only scratched the surface of Regex here, but it should cover 90+ percent of what you might need.  The 7 simple regex terms here; the whats, the repeats, and OR, it is possible to match just about anything you need in expect-lite.



PS. The above is not entirely correct, as the dot, is also a regex expression, but the above examples will work without having to know this eighth term.