Jul 16 2012
Nested strings with regular expressions (similar to recursive regex) in .NET
Parsing nested groups is a really common task required when parsing code-like strings.
Imagine you have to extract parameters from a function call like this one:
ProcedureName(param1,param2, param3)
which is the right regex to do the dirty work?
At a really first glance, in order to capture "param1, param2, param3", a regular expression like this could work:
(?\( [^()]* \) )
but this doesn't work in the case of nested parenthesis, like in this case:
ProcedureName(param1,subprocedure(param2,param3,param4), param3)
It is really known that regular expression engine could not manage nesting groups (exceptions exit in perl), in order to avoid to write a full syntax parser in order to match a fuction call with nested calls, .NET helps with balancing groups.
The Right Solution for .NET
In order to correctly manage this case balancing groups available in .NET regex engine are really useful.
Balancing groups act as a counter of groups with pop and push from the stack. The regex is valid if, and only if, the counter at the end of the matching returns to 0, actually balancing pushes to and pops from the stack.
Balancing groups syntax is quite simple:
(?) : push counter +1
(?) : pop counter -1
(?(groupname)(?!)) : check if the counter is 0
Using these features, counting nested parenthesis is simple:
\( (?(?> \( (? ) | \) (? ) | [^()]+ )* (?(NestedParenthesis)(?!)) ) \)
The key elements in this regular expression are:
- leading and closing literal parenthesis \( and \)
- count up and count down the nested parenthesis in order to catch matching ( and )
- matching of non-parenthesis charters with [^()]
How to manage literal strings
What if parameters can be also strings?
This string would make the regular expression failing in matching nested parameters:
ProcedureName(param1,subprocedure('Bad work : -(',param3,param4), param3)
the : -( emoticon would be matched as a nested parenthesis and all the regular expression would fail.
This case can ben matched with this modified code:
\( (?(?> \( (? ) | \) (? ) | [^()']+ | ('[^']*')* )* (?(NestedParenthesis)(?!)) ) \)
The differences from the previous regex are:
- Exclusion of apex (') from the general characters matching group
- Addition of a specific group to match literal strings:
('[^']*')*
Note that this code works well if the literal string from apex is the double apex (pascal-like).
Comments Off