Regex bug with lookahead AND lookbehind

General questions about using TextPad

Moderators: AmigoJack, helios, bbadmin, Bob Hansen, MudGuard

Post Reply
User avatar
jeffy
Posts: 323
Joined: Mon Mar 03, 2003 9:04 am
Location: Philadelphia

Regex bug with lookahead AND lookbehind

Post by jeffy »

I'm pretty sure I found a regex related bug here. Could someone please confirm? I'm using 7.0.9.

Code: Select all

text	text	text
Put the cursor somewhere in the middle word and search for '$' down (regex, without the quotes). The cursor goes to the end of the line. Then search for '(?<=\S)\t(?=\S)' down. It wraps around and finds the first tab (a tab between two non-whitespace characters).

Now do it again, but this time search for '(?<=\S)\t(?=\S)' up. It doesn't work (says not found).

However, if you put the cursor at the start of the line (actually, I think anywhere before the tab itself) and then search for the same thing, it works.

Thanks for checking!
ben_josephs
Posts: 2456
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

You're right. In fact it appears that all look-behind expressions fail when searching backwards. I suspect that fixing this would be a non-trivial job.

Negative look-behind ( (?<!...) ) appears not to work at all. It always matches, so the regex as a whole behaves as if the look-behind assertion wasn't there.
User avatar
jeffy
Posts: 323
Joined: Mon Mar 03, 2003 9:04 am
Location: Philadelphia

Post by jeffy »

I'm glad I'm not the only one. Thanks.

I wonder why it's a non-trivial change, although I expect the explanation is non-trivial :)
User avatar
jeffy
Posts: 323
Joined: Mon Mar 03, 2003 9:04 am
Location: Philadelphia

Post by jeffy »

Another example of this problem I just encountered:

Code: Select all

(?<=[ \t])\bIIMeta\b\s*\(\s*\b(\w+)(|(?:<[?\w ]+>)|(?:<[^<]*<[?\w ]+>[^>]*>)|(?:<[^<]*<[^<]*<[?\w ]+>[^>]*>[^>]*>))\s+(\w+)\b(?!>)\s*\)
This finds a Java function signature with exactly 1 parameter (including up to three levels of generics after the type). It should find either of these:

Code: Select all

	public IIMeta(IIMeta ii_toCopy)  {
	public IIMeta(String s_instanceName)  {
specifically selecting only "IIMeta(...)", but it doesn't work.

However, removing the back-reference (despite selecting the initial whitespace character, which is what I'm trying to avoid) does work:

Code: Select all

[ \t]\bIIMeta\b\s*\(\s*\b(\w+)(|(?:<[?\w ]+>)|(?:<[^<]*<[?\w ]+>[^>]*>)|(?:<[^<]*<[^<]*<[?\w ]+>[^>]*>[^>]*>))\s+(\w+)\b(?!>)\s*\)
Also, the backreference DOES work if you first click the very top of the document, and then search down, but it only finds the first instance, and then gets stuck again.
ben_josephs
Posts: 2456
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

If you prefix your regex with the modifier (?x) you can add white space without changing the meaning, making the regex (perhaps) somewhat easier to read:

Code: Select all

(?x) (?<=[ \t]) \b IIMeta \b \s* \( \s* \b (\w+) ( | (?:<[?\w ]+>) | (?:<[^<]*<[?\w ]+>[^>]*>) | (?:<[^<]*<[^<]*<[?\w ]+>[^>]*>[^>]*>) ) \s+ (\w+) \b (?!>) \s* \)
For the sake of simplicity, remove the Java template stuff, which isn't required in your example:

Code: Select all

(?x) (?<=[ \t]) \b IIMeta \b \s* \( \s* \b (\w+) \s+ (\w+) \b (?!>) \s* \)
It is now apparent that the word boundary anchors (\b) are redundant, and can be removed:

Code: Select all

(?x) (?<=[ \t]) IIMeta \s* \( \s* (\w+) \s+ (\w+) (?!>) \s* \)
As can the (?!>) look-ahead:

Code: Select all

(?x) (?<=[ \t]) IIMeta \s* \( \s* (\w+) \s+ (\w+) \s* \)
And some parentheses:

Code: Select all

(?x) (?<=[ \t]) IIMeta \s* \( \s* \w+ \s+ \w+ \s* \)
It is now clear that your regex matches the function name and its single parenthesised typed parameter.

I don't see anything wrong.

(By the way, the expression (?<=[ \t]) is a look-behind assertion, not a back-reference.)
User avatar
bbadmin
Site Admin
Posts: 782
Joined: Mon Feb 17, 2003 8:54 pm
Contact:

Post by bbadmin »

The Boost regular expression engine used by TextPad does not support backwards searches. (It's author says it would require a whole new state machine implementation and the dropping of lots of features.) The workaround that TextPad implements is to iterate backwards, a character at a time, and try to match the search pattern forwards from there. This works in most cases, but not with look behinds.
Post Reply