Regular Expression Hanging Question

General questions about using TextPad

Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard

NoMoreFood
Posts: 5
Joined: Wed Apr 03, 2013 12:29 am

Regular Expression Hanging Question

Post by NoMoreFood »

I believe since the new regular expression engine change, I've seen some behavior that has held me back from upgrading my users to any newer TextPad version.

In short, doing something seeming innocent like typing a few lines:

one
two
three

And then doing a replace all for '.*' with 'blah'. Now, I'd expect one of two things to happen here: either all three lines would be replaced with one line 'blah' or there would be three lines of 'blah'. Most editors (vi, Notepad++, and older versions of TextPad, I believe), will do the latter.

Instead of doing that, the newer versions of TextPad just sit there and hang until some internal buffer hits the max limit for the length of a single line and then the replace all is cancelled. I might guess this due to some sort of newline character being repeatedly getting re-matched over and over again.

Changing the search to '^.*' produces the desired behavior, but it's very frustrating when you use multiple editors and only TextPad hangs for this type of operation.

Is there any way to change this behavior?
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

The regex ".*" matches any number of (non-newline) characters, including none of them. That is, it matches the empty string. A regex that matches the empty string matches everywhere.

In your case, starting at the beginning of the document, TextPad searches for the first match of ".*". It finds one, namely the whole of the first line, and replaces it with "blah". Now, starting immediately after that replacement, TextPad searches for the first match of the regex ".*". It finds one, namely the empty string, right where it is, in front of the newline, and replaces what is matched, the empty string, with "blah". And so on ad infinitum.

You are right that some editors do not behave as TextPad does; they treat replacement of the empty string as a special case and move forward one character after each such replacement. Whether you think this is a good thing depends on your point of view. You may argue that although these other editors are not doing what the user requested, they might be doing what the user intended. Alternatively, you may argue that this behaviour is not consistent with the behaviour of replacements of non-empty strings. The behaviour of TextPad 7 is not unreasonable; the most you can say is that it's undesirable.

The moral (and the solution to your problem) is: don't use ".*" unless you really mean it. Use ".+" instead.
NoMoreFood
Posts: 5
Joined: Wed Apr 03, 2013 12:29 am

Post by NoMoreFood »

Thanks for the great explanation -- I concur 100% with your breakdown and logic.

I guess I'd still propose that the TextPad developers consider the "replacement of the empty string as a special case and move forward one character" approach or however other editors/libraries do it. It may not be the principled or moral thing to do, but because any well programmed application ought to protect against crashes and hangs. That and pretty much everything other regular expression parser seems to skip matching empty strings -- including the Perl and .Net base libraries.

Anyhow, just my opinion. Thanks again for the response - I do appreciate it.
vr8ce
Posts: 25
Joined: Thu Dec 04, 2003 6:54 pm

Post by vr8ce »

I don't really think it depends on anyone's point of view. It's a bad user experience. It's always been a bad user experience. It's one of the few small list of things I hate about Textpad. (Along with the change a few versions ago to suddenly force parentheses to be escaped in replacement strings.)

I'll be glad to wait here while someone shows other examples of where this usage works this way. I've never found one, and I have a dozen editors and programming environments.

It does not benefit anyone. It is not expected behavior. It should have never been implemented that way.
User avatar
Drxenos
Posts: 209
Joined: Mon Jul 07, 2003 8:38 pm

Post by Drxenos »

Although it's technically correct behavior for .* to match nothing when it could match more, I think it should exhibit greedy matching, unless told otherwise by the user. That is canonical with pretty much every regex engine out there.
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

vr8ce wrote:I don't really think it depends on anyone's point of view.
That TextPad gets itself into an enless loop isn't good. But if you want the tool to do something different from what the user requested you have given it the problem of guessing what the user intended.

Perhaps the guess doesn't depend on anyone's point of view, but it does depend on the tool you're using. Different tools make different guesses.

With Perl (v 5.12.4) the script

Code: Select all

my $s = "one\n"   .
        "two\n"   .
        "three\n" ;

$s =~ s/.*/blah/g ;

print "$s\n" ;
produces

Code: Select all

blahblah
blahblah
blahblah
With UltraEdit (v 21.20), given the text

Code: Select all

one
two
three
if the text has unix line endings the replacement produces

Code: Select all

blahblah
blahblah
blahblah
blah
and if the text has Windows line endings it produces

Code: Select all

blahblah
blah
blahblah
blah
blahblah
blah
blah
with some solitary carriage returns.

None of these is what the original poster wants.
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

Drxenos wrote:Although it's technically correct behavior for .* to match nothing when it could match more, I think it should exhibit greedy matching
The star operator is greedy and it doesn't match nothing when it could match more. In TextPad dot matches any character other than newline. So when the cursor is at the end of a line the only match at that position is the empty string. That is unrelated to greediness.

The issue is that when it has succeeded in matching an empty string (or is about to match one) it doesn't advance the cursor.
User avatar
Drxenos
Posts: 209
Joined: Mon Jul 07, 2003 8:38 pm

Post by Drxenos »

I think I'm still confused as to what the issue is. Why doesn't it match "one", "two", or "three"?
NoMoreFood
Posts: 5
Joined: Wed Apr 03, 2013 12:29 am

Post by NoMoreFood »

I think the only consensus is that it probably ought to not match empty sets. Would love to see a change in this regard to 7.x and 8.x.
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

Drxenos wrote:I think I'm still confused as to what the issue is. Why doesn't it match "one", "two", or "three"?
I don't understand your question. The regex .* does match one, two and three.
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

NoMoreFood wrote:I think the only consensus is that it probably ought to not match empty sets. Would love to see a change in this regard to 7.x and 8.x.
There is no such consensus. The regex .* must match the empty string. That is what * means.

If you don't want to match the empty string use .+ .
NoMoreFood
Posts: 5
Joined: Wed Apr 03, 2013 12:29 am

Post by NoMoreFood »

Consensus doesn't mean every last person.

I agree with your logic, but one has to realize that EVERY other text editor handles this as a special case since it's just not a good idea to put the user in a position where the program is going to hang, potentially loosing work.

Here's another innocent case. Let's say I want to put around 'begin-' and '-end' around every line:

one
two

three

So you do a search for (.*) and replace it with begin-\1-end. In Notepad++, this does the following:

begin-one-end
begin-two-end
begin--end
begin-three-end

In TextPad... hang. Is this really the most ideal behavior? Yeah, I could search for ^(.*)$ but this is normally implied behavior.

Anyhow, this will be my last post on this topic. I appreciate all the feedback others have given. I truly believe addressing this behavior would make TextPad a more robust editor. Thanks.
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

I used the word consensus in its original, stronger, meaning:
Agreement in opinion; the collective unanimous opinion of a number of persons.

[OED]
My comment was correct, even with the weaker sense of consensus. It was in reference to
I think the only consensus is that it probably ought to not match empty sets.
There certainly is no such consensus, at least amongst those who understand regular expressions.

There is, I suspect, a consensus that TextPad shouldn't get into an endless loop. Other editors avoid this, but, as I showed above, there is no consensus on what they should do when the user enters something different from what they mean. Some other editors do strange things in these cases.

If you mean
^.*
why not write
^.*
?
NoMoreFood
Posts: 5
Joined: Wed Apr 03, 2013 12:29 am

Post by NoMoreFood »

To answer your question, this is not so much about me as my users. I have no problem adjusting my behavior slightly although I find myself using Notepad++ more than TextPad anyhow (mainly due to it being installed on all our 12,000 systems).

We have over 400 TextPad licenses and we're all still using the 5.x series even thought we've bought additional licenses when 7.x came out. I suspect that if I roll this out, I will have some users complaining about the change in behavior. We have a large community that edits files of hundreds of megabytes and a hang can mean lost work / money.
User avatar
Drxenos
Posts: 209
Joined: Mon Jul 07, 2003 8:38 pm

Post by Drxenos »

ben_josephs wrote:
Drxenos wrote:I think I'm still confused as to what the issue is. Why doesn't it match "one", "two", or "three"?
I don't understand your question. The regex .* does match one, two and three.
Sorry if I wasn't clear. I meant, why doesn't match one of the above and be done with it. Why does it go on to then match an empty string at the end of the line? Seeing as * is greedy, I would assume it would be matching as much as it can on a line.
Post Reply