Some regexps which worked before now causes "eternal lo

General questions about using TextPad

Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard

Post Reply
Erik G D
Posts: 4
Joined: Tue Jan 29, 2008 9:37 am

Some regexps which worked before now causes "eternal lo

Post by Erik G D »

In earlier versions, replacing the whole contents of a line with something,
in this way: replace "\(.*\)" with "some text including \1;" did work.
Now (version 7.0.4) the corresponding regexp, "(.*)", causes TP to hang, eventually producing the error message (which btw cannot be copied and pasted): "Operation interrupted - it would make line 1 longer than (big number)".
I know I could rewrite the pattern to something like "(^..*)" to make it work, but point is it did work before, and I cannot see that the pattern should cause a recursive loop, because the match of the whole line should be preferred before the match of the first char.
(also, "eternal loops" should always be avoided.)
User avatar
MudGuard
Posts: 1295
Joined: Sun Mar 02, 2003 10:15 pm
Location: Munich, Germany
Contact:

Post by MudGuard »

Now (version 7.0.4) the corresponding regexp, "(.*)", causes TP to hang
Now we have 7.0.7 (for more than a month) - 7.0.7 has several fixes regarding regex.
You should update to the newest version ...
User avatar
bbadmin
Site Admin
Posts: 877
Joined: Mon Feb 17, 2003 8:54 pm
Contact:

Post by bbadmin »

".*" is dangerous to use, because it can match nothing, so it can keep matching at the same end of line. Use ".+" instead.

Incidentally, if a replace command looks like it is looping, click the Cancel button on the dialog box, and then you'll be able to see what was going on.
Erik G D
Posts: 4
Joined: Tue Jan 29, 2008 9:37 am

Post by Erik G D »

The bug has survived into 7.0.7
Inserting the line "oh dear", then replacing regexp "(.*)" with "o" results (after a while) in a long line starting with "oooooooo...".
Even though this problem can be circumvented by changing the regexp, I still consider it a bug, a bug that has been introduced after the rework of the regexp mechanism.
The equivalent in linux shows:

$echo "oh dear" | sed 's/.*/o/g'
o
$
User avatar
MudGuard
Posts: 1295
Joined: Sun Mar 02, 2003 10:15 pm
Location: Munich, Germany
Contact:

Post by MudGuard »

bbadmin wrote:".*" is dangerous to use, because it can match nothing, so it can keep matching at the same end of line. Use ".+" instead.

Incidentally, if a replace command looks like it is looping, click the Cancel button on the dialog box, and then you'll be able to see what was going on.
But usually, * is greedy, so it should match as much as possible, not nothing ...
User avatar
bbadmin
Site Admin
Posts: 877
Joined: Mon Feb 17, 2003 8:54 pm
Contact:

Post by bbadmin »

MudGuard wrote:But usually, * is greedy, so it should match as much as possible, not nothing ...
By definition, "*" matches 0 or more occurrences of the preceding subexpression. However, I concede the point that TextPad 7 behaves differently from earlier versions, in this respect, so we'll look into reinstating the old behaviour in the next release.
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

... and . (dot) doesn't match a newline.

What is happening is that at the first attempt, .* matches as early as possible and as much as possible, so it matches from the current position to the next non-newline character, that is, to the end of the line. The current position is then at the end of the line. So the next attempt matches similarly, as early and as much as possible, from where it is (the end of the line) to the next non-newline character (the same place). Thus, it matches the empty string at that position. To see this, try replacing (.*) with [$1].

Now try replacing ([^e]*) with [$1]. The behaviour is entirely analogous. Thus TextPad 7 is consistent in this respect.

In Old TextPad, replacing ([^e]*) with [\1] behaved exactly as replacing ([^e]*) with [$1] does in TextPad 7. So Old TextPad's behaviour when replacing (.*) was inconsistent with its behaviour when replacing ([^e]*). The reason for this inconsistency is perhaps that Old TextPad's regex recogniser is line-based and newlines are handled as a special case. TextPad 7 uses a different, more powerful, recogniser, which doesn't suffer from the limitations of the old line-based one.

(Note that sed is essentially line-based; TextPad isn't. So a comparison between TextPad and sed is not entirely relevant.)

There are four possible approaches:
1. Retain the existing consistent behaviour.
2. Reinstate Old TextPad's inconsistent behaviour.
3. Implement a new consistent behaviour where replacing both (.*) and ([^e]*) skips empty matches during repeated replacements.
4. Something else.

There's another weirdness. Try replacing (.*) with [$1], one replacement at a time, using Replace Next. The first time, TextPad replaces the empty string at the current position. The next time, it replaces the rest of the line. Old TextPad did the same thing when replacing (.*) with [\1].
sosimple
Posts: 30
Joined: Sat May 16, 2009 6:54 am

Post by sosimple »

Hi ben
ben_josephs wrote:... Now try replacing ([^e]*) with [$1]. The behaviour is entirely analogous. Thus TextPad 7 is consistent in this respect.

In Old TextPad, replacing ([^e]*) with [\1] behaved exactly as replacing ([^e]*) with [$1] does in TextPad 7. ...
I haven't tried Textpad 7 yet (I will, soon)...

I assume Textpad 7 uses a $ as a lead-in for back-references like $1$2 whereas previous versions of Textpad used a \ as a lead-in for back-references like \1\2

In your post, is there a significance to using the letter e in ([^e]*), or is it just used as an example of a character that isn't present in the searched text?

Also, is there a significance to using square-brackets around $1 like [$1], or were the brackets used for emphasis?
sosimple
Posts: 30
Joined: Sat May 16, 2009 6:54 am

Post by sosimple »

... replace "\(.*\)" with "some text including \1;" did work.
Now (version 7.0.4) the corresponding regexp, "(.*)", causes TP to hang ...
It's unclear if it's hanging when you click Replace, or Replace Next, or Replace All.

Either way, this does seem wrong.

With Textpad 5.4.2:
  1. 1) Successively clicking Replace begins each next search at the beginning of the same line, repeatedly replacing the same entire line, but does not hang.
    2) Successively clicking Replace, then Find Next begins each next search at one character after the start of the previous replacement, repeatedly replacing the same entire line but skipping one additional character at the beginning each time, but does not hang.
    3) Successively clicking Replace Next, TextPad first replaces the empty string at the current position (non-greedy behavior), then replaces the remainder of the line. Then Textpad alternates replacing the empty string at the beginning of the next line, followed by replacing the text on that line following the replacement text (or to the next line if that line is empty). And still, no hang.
    4) Clicking Replace All makes one replacement on each line (including empty lines), and again, no hang. Here, each next search begins with the first (non-empty) character following the replacement text. So, the fact that the replacement includes the text in \1 (or $1) does not have any effect on further searches.
It sounds like the behavior of (#4) is what you're expecting. Which way are you doing search-replace with Textpad 7 to get it to hang?
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

TextPad doesn't use $ for back-references in a regex: that's still \. For example \b(\w+)\s+\1\b matches repeated words.

But in a replacement expression (not a regex) it does uses the Perlish $n to represent what matched the nth captured substring (that's not a back-reference). But you can still use \n for that.

I used e as it's a character that is in the searched text ("oh dear"). It's to stop the regex matching the whole line, to make it easier to see what's happening. I used the brackets also to highlight what's happening.

Neither TextPad < 7 nor TextPad 7 behaves reasonably in all cases here. But the looping behaviour of Replace All in TextPad 7 is not unreasonable: it's doing precisely what the user asked it to do. The most you can say is that it's undesirable.

As has been said in this thread, elsewhere in these forums and across the web: if you use .* when you don't mean it you will come a cropper.
Post Reply