Regular Expression Hanging Question
Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard
-
- Posts: 5
- Joined: Wed Apr 03, 2013 12:29 am
Regular Expression Hanging Question
I believe since the new regular expression engine change, I've seen some behavior that has held me back from upgrading my users to any newer TextPad version.
In short, doing something seeming innocent like typing a few lines:
one
two
three
And then doing a replace all for '.*' with 'blah'. Now, I'd expect one of two things to happen here: either all three lines would be replaced with one line 'blah' or there would be three lines of 'blah'. Most editors (vi, Notepad++, and older versions of TextPad, I believe), will do the latter.
Instead of doing that, the newer versions of TextPad just sit there and hang until some internal buffer hits the max limit for the length of a single line and then the replace all is cancelled. I might guess this due to some sort of newline character being repeatedly getting re-matched over and over again.
Changing the search to '^.*' produces the desired behavior, but it's very frustrating when you use multiple editors and only TextPad hangs for this type of operation.
Is there any way to change this behavior?
In short, doing something seeming innocent like typing a few lines:
one
two
three
And then doing a replace all for '.*' with 'blah'. Now, I'd expect one of two things to happen here: either all three lines would be replaced with one line 'blah' or there would be three lines of 'blah'. Most editors (vi, Notepad++, and older versions of TextPad, I believe), will do the latter.
Instead of doing that, the newer versions of TextPad just sit there and hang until some internal buffer hits the max limit for the length of a single line and then the replace all is cancelled. I might guess this due to some sort of newline character being repeatedly getting re-matched over and over again.
Changing the search to '^.*' produces the desired behavior, but it's very frustrating when you use multiple editors and only TextPad hangs for this type of operation.
Is there any way to change this behavior?
-
- Posts: 2461
- Joined: Sun Mar 02, 2003 9:22 pm
The regex ".*" matches any number of (non-newline) characters, including none of them. That is, it matches the empty string. A regex that matches the empty string matches everywhere.
In your case, starting at the beginning of the document, TextPad searches for the first match of ".*". It finds one, namely the whole of the first line, and replaces it with "blah". Now, starting immediately after that replacement, TextPad searches for the first match of the regex ".*". It finds one, namely the empty string, right where it is, in front of the newline, and replaces what is matched, the empty string, with "blah". And so on ad infinitum.
You are right that some editors do not behave as TextPad does; they treat replacement of the empty string as a special case and move forward one character after each such replacement. Whether you think this is a good thing depends on your point of view. You may argue that although these other editors are not doing what the user requested, they might be doing what the user intended. Alternatively, you may argue that this behaviour is not consistent with the behaviour of replacements of non-empty strings. The behaviour of TextPad 7 is not unreasonable; the most you can say is that it's undesirable.
The moral (and the solution to your problem) is: don't use ".*" unless you really mean it. Use ".+" instead.
In your case, starting at the beginning of the document, TextPad searches for the first match of ".*". It finds one, namely the whole of the first line, and replaces it with "blah". Now, starting immediately after that replacement, TextPad searches for the first match of the regex ".*". It finds one, namely the empty string, right where it is, in front of the newline, and replaces what is matched, the empty string, with "blah". And so on ad infinitum.
You are right that some editors do not behave as TextPad does; they treat replacement of the empty string as a special case and move forward one character after each such replacement. Whether you think this is a good thing depends on your point of view. You may argue that although these other editors are not doing what the user requested, they might be doing what the user intended. Alternatively, you may argue that this behaviour is not consistent with the behaviour of replacements of non-empty strings. The behaviour of TextPad 7 is not unreasonable; the most you can say is that it's undesirable.
The moral (and the solution to your problem) is: don't use ".*" unless you really mean it. Use ".+" instead.
-
- Posts: 5
- Joined: Wed Apr 03, 2013 12:29 am
Thanks for the great explanation -- I concur 100% with your breakdown and logic.
I guess I'd still propose that the TextPad developers consider the "replacement of the empty string as a special case and move forward one character" approach or however other editors/libraries do it. It may not be the principled or moral thing to do, but because any well programmed application ought to protect against crashes and hangs. That and pretty much everything other regular expression parser seems to skip matching empty strings -- including the Perl and .Net base libraries.
Anyhow, just my opinion. Thanks again for the response - I do appreciate it.
I guess I'd still propose that the TextPad developers consider the "replacement of the empty string as a special case and move forward one character" approach or however other editors/libraries do it. It may not be the principled or moral thing to do, but because any well programmed application ought to protect against crashes and hangs. That and pretty much everything other regular expression parser seems to skip matching empty strings -- including the Perl and .Net base libraries.
Anyhow, just my opinion. Thanks again for the response - I do appreciate it.
I don't really think it depends on anyone's point of view. It's a bad user experience. It's always been a bad user experience. It's one of the few small list of things I hate about Textpad. (Along with the change a few versions ago to suddenly force parentheses to be escaped in replacement strings.)
I'll be glad to wait here while someone shows other examples of where this usage works this way. I've never found one, and I have a dozen editors and programming environments.
It does not benefit anyone. It is not expected behavior. It should have never been implemented that way.
I'll be glad to wait here while someone shows other examples of where this usage works this way. I've never found one, and I have a dozen editors and programming environments.
It does not benefit anyone. It is not expected behavior. It should have never been implemented that way.
-
- Posts: 2461
- Joined: Sun Mar 02, 2003 9:22 pm
That TextPad gets itself into an enless loop isn't good. But if you want the tool to do something different from what the user requested you have given it the problem of guessing what the user intended.vr8ce wrote:I don't really think it depends on anyone's point of view.
Perhaps the guess doesn't depend on anyone's point of view, but it does depend on the tool you're using. Different tools make different guesses.
With Perl (v 5.12.4) the script
Code: Select all
my $s = "one\n" .
"two\n" .
"three\n" ;
$s =~ s/.*/blah/g ;
print "$s\n" ;
Code: Select all
blahblah
blahblah
blahblah
Code: Select all
one
two
three
Code: Select all
blahblah
blahblah
blahblah
blah
Code: Select all
blahblah
blah
blahblah
blah
blahblah
blah
blah
None of these is what the original poster wants.
-
- Posts: 2461
- Joined: Sun Mar 02, 2003 9:22 pm
The star operator is greedy and it doesn't match nothing when it could match more. In TextPad dot matches any character other than newline. So when the cursor is at the end of a line the only match at that position is the empty string. That is unrelated to greediness.Drxenos wrote:Although it's technically correct behavior for .* to match nothing when it could match more, I think it should exhibit greedy matching
The issue is that when it has succeeded in matching an empty string (or is about to match one) it doesn't advance the cursor.
-
- Posts: 5
- Joined: Wed Apr 03, 2013 12:29 am
-
- Posts: 2461
- Joined: Sun Mar 02, 2003 9:22 pm
-
- Posts: 2461
- Joined: Sun Mar 02, 2003 9:22 pm
-
- Posts: 5
- Joined: Wed Apr 03, 2013 12:29 am
Consensus doesn't mean every last person.
I agree with your logic, but one has to realize that EVERY other text editor handles this as a special case since it's just not a good idea to put the user in a position where the program is going to hang, potentially loosing work.
Here's another innocent case. Let's say I want to put around 'begin-' and '-end' around every line:
one
two
three
So you do a search for (.*) and replace it with begin-\1-end. In Notepad++, this does the following:
begin-one-end
begin-two-end
begin--end
begin-three-end
In TextPad... hang. Is this really the most ideal behavior? Yeah, I could search for ^(.*)$ but this is normally implied behavior.
Anyhow, this will be my last post on this topic. I appreciate all the feedback others have given. I truly believe addressing this behavior would make TextPad a more robust editor. Thanks.
I agree with your logic, but one has to realize that EVERY other text editor handles this as a special case since it's just not a good idea to put the user in a position where the program is going to hang, potentially loosing work.
Here's another innocent case. Let's say I want to put around 'begin-' and '-end' around every line:
one
two
three
So you do a search for (.*) and replace it with begin-\1-end. In Notepad++, this does the following:
begin-one-end
begin-two-end
begin--end
begin-three-end
In TextPad... hang. Is this really the most ideal behavior? Yeah, I could search for ^(.*)$ but this is normally implied behavior.
Anyhow, this will be my last post on this topic. I appreciate all the feedback others have given. I truly believe addressing this behavior would make TextPad a more robust editor. Thanks.
-
- Posts: 2461
- Joined: Sun Mar 02, 2003 9:22 pm
I used the word consensus in its original, stronger, meaning:
There is, I suspect, a consensus that TextPad shouldn't get into an endless loop. Other editors avoid this, but, as I showed above, there is no consensus on what they should do when the user enters something different from what they mean. Some other editors do strange things in these cases.
If you mean
^.*
why not write
^.*
?
My comment was correct, even with the weaker sense of consensus. It was in reference toAgreement in opinion; the collective unanimous opinion of a number of persons.
[OED]
There certainly is no such consensus, at least amongst those who understand regular expressions.I think the only consensus is that it probably ought to not match empty sets.
There is, I suspect, a consensus that TextPad shouldn't get into an endless loop. Other editors avoid this, but, as I showed above, there is no consensus on what they should do when the user enters something different from what they mean. Some other editors do strange things in these cases.
If you mean
^.*
why not write
^.*
?
-
- Posts: 5
- Joined: Wed Apr 03, 2013 12:29 am
To answer your question, this is not so much about me as my users. I have no problem adjusting my behavior slightly although I find myself using Notepad++ more than TextPad anyhow (mainly due to it being installed on all our 12,000 systems).
We have over 400 TextPad licenses and we're all still using the 5.x series even thought we've bought additional licenses when 7.x came out. I suspect that if I roll this out, I will have some users complaining about the change in behavior. We have a large community that edits files of hundreds of megabytes and a hang can mean lost work / money.
We have over 400 TextPad licenses and we're all still using the 5.x series even thought we've bought additional licenses when 7.x came out. I suspect that if I roll this out, I will have some users complaining about the change in behavior. We have a large community that edits files of hundreds of megabytes and a hang can mean lost work / money.
Sorry if I wasn't clear. I meant, why doesn't match one of the above and be done with it. Why does it go on to then match an empty string at the end of the line? Seeing as * is greedy, I would assume it would be matching as much as it can on a line.ben_josephs wrote:I don't understand your question. The regex .* does match one, two and three.Drxenos wrote:I think I'm still confused as to what the issue is. Why doesn't it match "one", "two", or "three"?