German Perl Workshop 9 — Advanced Perl 5.10 Regular Expressions

I am currently at the German Perl Workshop 9 in Munich. I'll try and blog a bit about the workshop, so here is my take on the first talk, which was a three hour introduction to advanced regular expressions in Perl 5.10. Yves Orton (demerphq) starts out by reminding people that Perl 5.10 will be coming soon (before christmas — and contrary to the common Perl 6 joke he really seems to be talking about this christmas). Quote: »This should excite you. If it doesn't then you should drink more coffee!«. As a motivation, he also shows this strip from one of my favourite webcomics. Furthermore, he is dressed in Lederhosen because »if you're dressed like this, people have to buy you beer« (don't know how good that worked out last night).

So Yves has been hacking quite a bit on regular expressions for Perl 5.10, most of which is already in the bleeding edge Perl 5.9.5:

Posessive quantifiers

Say you would like to match balanced quoted strings. The naive way would be to use something like this:

my $quote_re = qr/
  "
    (?:
       [^"\\]+ |
       (?: \\.)+ 
    )*
  "
/x;

This works, but the combination of + and * leads to a combinatorial explosion. You can easily see that by trying it out with use re 'debug'; and counting the lines of debug output. If you try to match the above against q("fooooooooo\") (which will not match because the second quote is escaped), you get about a thousand lines of debug output.

In Perl 5.8.x, there already is the possibility to use the (?> pattern) construct (see perldoc perlre) to prevent backtracking. But to use it, we would have to write the following regex:

my $quote_re = qr/
  "
    (?>
      (?:
        (?> [^"\\]+) |
        (?> 
          (?: \\.)+
        )
      )*
    )
  "
/x;

This works and only shows about 120 lines of debug output on the above string. On the other hand, this of course makes our eyes bleed.

This is why in Perl 5.10, there is the possessive quantifier, which reduces the above to the following:

my $quote_re = qr/
  "
    (?:
       [^"\\]++ |
       (?: \\.)++
    )*+
  "
/x;

I guess you can deduce the syntax from the highlighting. Neat, eh?

Named capture buffers

In case you have not yet noticed, numbered captures buffers suck. Yves mentions that they suck especially if you have an embedded regex (for example from user input — although I wonder why you would want to allow that anyways): if you have something like this / (foo) $user_qr (what-number-am-i) /x, you have no possible way to know what number references the (what-number-am-i) capture buffer.

This is why Perl 5.10 introduces named capture buffers with a similar syntax to the one in .NET (because the one used in Python is uglier). They are declared like this: (?pattern) or (?'name'pattern) and you can then backreference them using \k<name> or \k'name'. Multiple buffers may have the same name, if only one of them matches, you get the content of it in the backreference, if more than one matches, you get the leftmost one in \k<name>.

Furthermore, the %+ hash contains the leftmost matches for the given names (so %+{name} gives you the same as \k<name> with in the regex). The %- hash contains all capture buffers referenced by name. This also makes it easy to use exists to see if one of the named capture buffers matched.

Numbered capture buffers suck for another reason: the backreference is ambivalent: \XX can either be a backreference or an octal number. So how does Perl decide whether to treat it as a backreference or an octal number? Actually it uses ugly heuristics: if XX is less than 10, it always is a backreference, if is greater or equal than 10, it is either a backreference or octal, depending on how many capture buffers there are.

You also can not easily concatenate such strings, think '\1' . '1'. There is even no good solution as to how to escape something like this, escaping the 1 as \1 is obviously no solution. This is why Perl 5.10 introduces a new syntax for numbered backreferences, too: g{1} or \g1 (g stands for »group«) is the new \1. This also allows for relative backrefences: \g{-1} or g-1 refers to the previous capture buffer. An application for this is the following dupe word matching regex: /(\w+)\s+\g{-1}/, which can not be expressed that easily in current Perl versions.

Optimizations

Perl 5.10 will also include some new optimizations, in particular Tries and Aho-Corasick for alternations starting with a fixed string, which means that things like /foam|foal|foad/ will then be optimized automatically to /foa[mld]/ as it can be easily deduced that the first three characters are shared after building the Trie.

Perl 5.10 also optimized one-element character classes. Apparently, lots of people have started to escape things using one-element character classes (and I admit to having done so myself, as [␣] IMHO looks much more readable than \␣. But apparently, in Perl 5.8, this is really bad as the character class will be translated into a full ANYOF structure which makes the regex a lot slower. This is specifically the case if you only have a string and not a »real« regular expression, as the string will not even trigger the regular expression engine, while the string with a character escaped using a one-element character class will. Luckily, this will no longer be a problem in Perl 5.10, as it optimizes them away into the character itself.

Backtracking Control Verbs

Backtracking control verbs are a new feature that allow for a lot of control over what the backtracking engine will do. Apparently, they were introduced because they are especially helpful for the Spamassassin guys — which was quite surprising for me, as it basically means introducing a language feature for an application. But hey, looks like an interesting feature, so why not. I'll spare you the exact details (which means that I don't have to understand them right now, too :-), but I guess what will be most interesting for things like Spamassassin is the new (*MARK:name) ((*:name) in short) control verb. This can be used like this: /foo(*:A)|bar(*:B)|baz(*:C)/ to see which part of a regular expression actually matched without grouping everything and then looking into the corresponding group. This apparently saves a lot of RAM if you don't need the content of the group anyways (and this is exactly what things like Spamassassin need, they just need to know what matched, but they usually don't care for the exact content of the match).

/k and \K

The /k modifier is another nice new feature of Perl 5.10. It gives you the variables ${^PREMATCH}, ${^MATCH} and ${^POSTMATCH}, which work in exactly the same way as $PREMATCH, $MATCH and $POSTMATCH. The only difference is that you don't pay a global performance penalty if you use it.

Totally unrelated except for sharing the same character is the keep pattern \K. It can be used to tell the regex engine that everything before it should not be included in the match. This is extremely useful in substition because capturing can be avoided in this way (which again saves both memory and time). The basic example is that s/foo+\Kbar/baz/ is the same as s/(fo+)bar/${1}baz/ but the former is a lot faster.

All in all, I really enjoyed the talk, although some of the things were quite hard to grasp. But hey, better than a boring talk about things I already know.

shift or die

security. photography. foobar.