Regular Expression Hanging Question

General questions about using TextPad

Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard

ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

Reread my first post in this thread.

The regex does match as much as it can. Having matched an entire line and replaced it with something, the cursor is at the end of the line. The question is: what should it match next? The answer is that, in the absence of special handling, the next thing it matches is the empty string between the cursor and the line terminator.

The reason for this is the "leftmost match" rule: when matching within a string always choose the leftmost (first) match. This rule takes precedence over the "longest match" rule (greediness): the leftmost match will be found even if there's a longer match that is not leftmost.

Suppose you are replacing a single match of .* with blah.

If the cursor is on an empty line, running the replacement will replace the leftmost match, that is, the empty string between the cursor and the line terminator, with blah. If that's not what you expect, you're using the wrong regex.

In the same way, if the cursor is at the end of a non-empty line, running the replacement will replace the leftmost match, that is, the empty string between the cursor and the line terminator, with blah. Again, if that's not what you expect, you're using the wrong regex.

If you run the replacement repeatedly, a single replacement at a time, then after each replacement, the cursor is at the end of the line. The leftmost match at this position is the empty string, which is replaced with blah. After the replacement, the cursor is still at the end of the line, and so the next leftmost match is again the empty string. Each single replacement therefore appends another blah.

TextPad handles a single "replace all" action in the same way as multiple "replace one" actions. This has the unfortunate consequence that replacing all matches of .* with blah will repeatedly append blahs until the process is interrupted, or it hangs, or the cows come home.

That is the answer to your question.

The question of how the endless loop can be avoided is a different one. When performing a "replace all" action TextPad might check after each single replacement whether the next match would be the empty string, and, if so, advance the cursor by one character. However, any such special handling is a case of doing something different from what the user requested. Whatever solution might be implemented, there will always be cases where the user wants a different solution.

Keep in mind that when the editor loops endlessly in such cases it is doing what the user told it to do, albeit that it is perhaps not what the user wanted it to do.
User avatar
Drxenos
Posts: 209
Joined: Mon Jul 07, 2003 8:38 pm

Post by Drxenos »

I've never heard of the left-most rule, only greedy or non-greedy. I still don't see why a regex engine should treat the end of the line (or the nonexistent "empty" end-of-line between what it just matched and the EOLN character(s)) as special. Nor would I consider an regex which doesn't do that as broken.
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

The empty string is nonexistent exactly to the extent that the number 0 is nonexistent: that is, not at all nonexistent.

The regular expression .* matches the empty string, by the very definition of the * operator. When finding matches of .* the recogniser does not treat the empty string as special. That is why it finds it. What is being requested is for the recogniser to treat the empty string as special so that sometimes it does not find it.
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

Suppose you are replacing all matches of a with x in the text
aaaa
and that the cursor is at the beginning of the text.

Starting at the initial cursor position, the recogniser finds the first match, that is, the first a, and replaces it with x, leaving the cursor at the end of the replacement, at the second a. Starting at the new cursor position, the recogniser finds the first match, that is, the second a, and replaces it with x, leaving the cursor at the end of the replacement, at the third a. And so on.

It is the same with .* . Suppose you are replacing all matches of .* with x in the text
aaaa
and that the cursor is at the beginning of the text.

Starting at the initial cursor position, the recogniser finds the first match, that is, all of the text, and replaces it with x, leaving the cursor at the end of the replacement, at the end of the text. Starting at the new cursor position, the recogniser finds the first match, that is, the empty string, and replaces it with x, leaving the cursor at the end of the replacement, still at the and of the text. And so on.

Notice that the behaviour is the same in the two cases; there is no special treatment. If special treatment were introduced, so that certain empty matches are not matched, the looping behaviour caused by the injudicious use of .* might be prevented.
User avatar
Drxenos
Posts: 209
Joined: Mon Jul 07, 2003 8:38 pm

Post by Drxenos »

I'm not trying to be obtuse. I'm just trying to understand.
User avatar
Drxenos
Posts: 209
Joined: Mon Jul 07, 2003 8:38 pm

Post by Drxenos »

I think you misunderstanding what I am asking. I understand how regular expressions work. I've had many college courses in regular languages, finite automata, etc.

What I don't understand is why the * operator seems to be stuck where it is. It has already done a match for the line. Why does it come back to match more?

Given the line "aaaa\n", if I match ".*", it matches "aaaa" (assuming a greedy match). Why does it then sit and spin after the last 'a'? Is it because it can't get passed the EOLN? Why is it not simply "done" that this point?
vr8ce
Posts: 25
Joined: Thu Dec 04, 2003 6:54 pm

Post by vr8ce »

There is a consensus, however the word is defined (nothing like a pedant to liven up a forum). The consensus is that TP shouldn't be doing what it is doing, which is to lock up.

As already stated, no other editor does that. What choices they make, and what choice TP would make, are not the issue. Pick something. Pick anything but locking up.
Post Reply