When I learned of regular expression engines that support recursion I thought I could write a recursive-descent parser in regex. Since I've written JSON parsers a few times and it's a simple spec, I chose that as the test case. In the end I created two versions.
jsonre_v1 = regex.compile(
r'''
(?: (?P<object> \{\s* (?&members)? \s*\})
(?P<members> (?&member) ( \s*,\s* (?&member) )*)
(?P<member> (?&string) \s*:\s* (?&value))
(?P<array> \[\s* ((?&value) (\s*,\s* (?&value))*)? \s*\])
(?P<string> " [^"\\]* (?: \\. | [^"\\]* )* ")
(?P<number> (?&integer) (?&fraction)? (?&exponent)?)
(?P<integer> -? (?: 0 | [1-9][0-9]* ))
(?P<fraction> \. [0-9]*)
(?P<exponent> [eE] [-+]? [0-9]+)
){0}
(?P<value> (?&object) | (?&array) | (?&string) | (?&number) |
true | false | null )
''',
flags=regex.VERBOSE|regex.UNICODE)
The first version looks more like a PEG grammar, which I accomplished by putting all the "rules", except the starting one, in a group that repeats exactly 0 times, meaning its presence in the expression does not match something in the input but only establishes the recursive patterns.
jsonre_v2 = regex.compile(
r'''
(?P<value>
(?P<object> \{\s*
(?: (?P<member>(?&string) \s*:\s* (?&value))
( \s*,\s* (?&member) )* )?
\s*\})
| (?P<array> \[\s* ((?&value) (\s*,\s* (?&value))*)? \s*\])
| (?P<string> " [^"\\]* (?: \\. | [^"\\]* )* ")
| (?P<number> (?P<integer> -? (?: 0 | [1-9][0-9]* ))
(?: \. [0-9]* )?
(?: [eE] [-+]? [0-9]+ )?)
| true
| false
| null
)
''',
flags=regex.VERBOSE|regex.UNICODE)
The second version is defined more naturally as a regex, but it's a little harder to read as a grammar. It turns out this version is also more performant.
Below is the output of the program where I compare it to the built-in JSON parser written in C. First is a basic test for different kinds of JSON values:
======= FIRST TEST =======
{"key": [{"bool": true, "number": -3e-05}, {"string": "hello world", "unicode": "\u4e16\u754c\uff0c\u4f60\u597d"}]}
jsonre_v1 matches
jsonre_v1: 100x took 0.007197466999059543s
jsonre_v2 matches
jsonre_v2: 100x took 0.0032083129990496673s
json module: 100x took 0.0005846970016136765s
The v2 pattern is about twice as fast as v1, but both are an order of magnitude slower than the C parser (which isn't all that bad, really).
The second test has a very recursive structure: an array of arrays 100 levels deep:
======= SECOND TEST =======
[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
jsonre_v1 matches
jsonre_v1: 100x took 0.05688649399962742s
jsonre_v2 matches
jsonre_v2: 100x took 0.01881064599729143s
json module: 100x took 0.0007663409996894188s
The v1 pattern is now about 3.5x slower than the v2 pattern, but both are more than two orders of magnitude slower than the C parser. Clearly recursion hurts the regular expression engine's performance more than non-recursive patterns.
The third test is to see the performance with a parse failure. I test this by prefixing the previous pattern with a object that is not closed at the end. I expect (and observe) that the regular expression parser, with it's backtracking, performs very poorly:
======= THIRD TEST =======
{"key":[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
jsonre_v1 does not match
jsonre_v1: 100x took 1.6580609960001311s
jsonre_v2 does not match
jsonre_v2: 100x took 0.13697527600015746s
json module: 100x took 0.0015753240004414693s
The v1 parser is now about 30x slower than before, 12x slower than v2 for this test, and three orders of magnitude slower than the C parser for this test; the v2 parser is a little more than 7x slower than before and is still two orders of magnitude slower than the C parser. The 2x slowdown for the C parser is probably due to the exception handling I do rather than a slowdown of the parser itself.
After creating this I found that someone else had done it on StackOverflow: https://stackoverflow.com/a/3845829/1441112
And of course there's this (in)famous post about parsing HTML: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
Wikipedia has a nice table showing various regular expression engines that support recursion: https://en.wikipedia.org/wiki/Comparison_of_regular-expression_engines