Some regexps which worked before now causes "eternal lo
Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard
Some regexps which worked before now causes "eternal lo
In earlier versions, replacing the whole contents of a line with something,
in this way: replace "\(.*\)" with "some text including \1;" did work.
Now (version 7.0.4) the corresponding regexp, "(.*)", causes TP to hang, eventually producing the error message (which btw cannot be copied and pasted): "Operation interrupted - it would make line 1 longer than (big number)".
I know I could rewrite the pattern to something like "(^..*)" to make it work, but point is it did work before, and I cannot see that the pattern should cause a recursive loop, because the match of the whole line should be preferred before the match of the first char.
(also, "eternal loops" should always be avoided.)
in this way: replace "\(.*\)" with "some text including \1;" did work.
Now (version 7.0.4) the corresponding regexp, "(.*)", causes TP to hang, eventually producing the error message (which btw cannot be copied and pasted): "Operation interrupted - it would make line 1 longer than (big number)".
I know I could rewrite the pattern to something like "(^..*)" to make it work, but point is it did work before, and I cannot see that the pattern should cause a recursive loop, because the match of the whole line should be preferred before the match of the first char.
(also, "eternal loops" should always be avoided.)
The bug has survived into 7.0.7
Inserting the line "oh dear", then replacing regexp "(.*)" with "o" results (after a while) in a long line starting with "oooooooo...".
Even though this problem can be circumvented by changing the regexp, I still consider it a bug, a bug that has been introduced after the rework of the regexp mechanism.
The equivalent in linux shows:
$echo "oh dear" | sed 's/.*/o/g'
o
$
Inserting the line "oh dear", then replacing regexp "(.*)" with "o" results (after a while) in a long line starting with "oooooooo...".
Even though this problem can be circumvented by changing the regexp, I still consider it a bug, a bug that has been introduced after the rework of the regexp mechanism.
The equivalent in linux shows:
$echo "oh dear" | sed 's/.*/o/g'
o
$
But usually, * is greedy, so it should match as much as possible, not nothing ...bbadmin wrote:".*" is dangerous to use, because it can match nothing, so it can keep matching at the same end of line. Use ".+" instead.
Incidentally, if a replace command looks like it is looping, click the Cancel button on the dialog box, and then you'll be able to see what was going on.
By definition, "*" matches 0 or more occurrences of the preceding subexpression. However, I concede the point that TextPad 7 behaves differently from earlier versions, in this respect, so we'll look into reinstating the old behaviour in the next release.MudGuard wrote:But usually, * is greedy, so it should match as much as possible, not nothing ...
-
- Posts: 2461
- Joined: Sun Mar 02, 2003 9:22 pm
... and . (dot) doesn't match a newline.
What is happening is that at the first attempt, .* matches as early as possible and as much as possible, so it matches from the current position to the next non-newline character, that is, to the end of the line. The current position is then at the end of the line. So the next attempt matches similarly, as early and as much as possible, from where it is (the end of the line) to the next non-newline character (the same place). Thus, it matches the empty string at that position. To see this, try replacing (.*) with [$1].
Now try replacing ([^e]*) with [$1]. The behaviour is entirely analogous. Thus TextPad 7 is consistent in this respect.
In Old TextPad, replacing ([^e]*) with [\1] behaved exactly as replacing ([^e]*) with [$1] does in TextPad 7. So Old TextPad's behaviour when replacing (.*) was inconsistent with its behaviour when replacing ([^e]*). The reason for this inconsistency is perhaps that Old TextPad's regex recogniser is line-based and newlines are handled as a special case. TextPad 7 uses a different, more powerful, recogniser, which doesn't suffer from the limitations of the old line-based one.
(Note that sed is essentially line-based; TextPad isn't. So a comparison between TextPad and sed is not entirely relevant.)
There are four possible approaches:
1. Retain the existing consistent behaviour.
2. Reinstate Old TextPad's inconsistent behaviour.
3. Implement a new consistent behaviour where replacing both (.*) and ([^e]*) skips empty matches during repeated replacements.
4. Something else.
There's another weirdness. Try replacing (.*) with [$1], one replacement at a time, using Replace Next. The first time, TextPad replaces the empty string at the current position. The next time, it replaces the rest of the line. Old TextPad did the same thing when replacing (.*) with [\1].
What is happening is that at the first attempt, .* matches as early as possible and as much as possible, so it matches from the current position to the next non-newline character, that is, to the end of the line. The current position is then at the end of the line. So the next attempt matches similarly, as early and as much as possible, from where it is (the end of the line) to the next non-newline character (the same place). Thus, it matches the empty string at that position. To see this, try replacing (.*) with [$1].
Now try replacing ([^e]*) with [$1]. The behaviour is entirely analogous. Thus TextPad 7 is consistent in this respect.
In Old TextPad, replacing ([^e]*) with [\1] behaved exactly as replacing ([^e]*) with [$1] does in TextPad 7. So Old TextPad's behaviour when replacing (.*) was inconsistent with its behaviour when replacing ([^e]*). The reason for this inconsistency is perhaps that Old TextPad's regex recogniser is line-based and newlines are handled as a special case. TextPad 7 uses a different, more powerful, recogniser, which doesn't suffer from the limitations of the old line-based one.
(Note that sed is essentially line-based; TextPad isn't. So a comparison between TextPad and sed is not entirely relevant.)
There are four possible approaches:
1. Retain the existing consistent behaviour.
2. Reinstate Old TextPad's inconsistent behaviour.
3. Implement a new consistent behaviour where replacing both (.*) and ([^e]*) skips empty matches during repeated replacements.
4. Something else.
There's another weirdness. Try replacing (.*) with [$1], one replacement at a time, using Replace Next. The first time, TextPad replaces the empty string at the current position. The next time, it replaces the rest of the line. Old TextPad did the same thing when replacing (.*) with [\1].
Hi ben
I assume Textpad 7 uses a $ as a lead-in for back-references like $1$2 whereas previous versions of Textpad used a \ as a lead-in for back-references like \1\2
In your post, is there a significance to using the letter e in ([^e]*), or is it just used as an example of a character that isn't present in the searched text?
Also, is there a significance to using square-brackets around $1 like [$1], or were the brackets used for emphasis?
I haven't tried Textpad 7 yet (I will, soon)...ben_josephs wrote:... Now try replacing ([^e]*) with [$1]. The behaviour is entirely analogous. Thus TextPad 7 is consistent in this respect.
In Old TextPad, replacing ([^e]*) with [\1] behaved exactly as replacing ([^e]*) with [$1] does in TextPad 7. ...
I assume Textpad 7 uses a $ as a lead-in for back-references like $1$2 whereas previous versions of Textpad used a \ as a lead-in for back-references like \1\2
In your post, is there a significance to using the letter e in ([^e]*), or is it just used as an example of a character that isn't present in the searched text?
Also, is there a significance to using square-brackets around $1 like [$1], or were the brackets used for emphasis?
It's unclear if it's hanging when you click Replace, or Replace Next, or Replace All.... replace "\(.*\)" with "some text including \1;" did work.
Now (version 7.0.4) the corresponding regexp, "(.*)", causes TP to hang ...
Either way, this does seem wrong.
With Textpad 5.4.2:
- 1) Successively clicking Replace begins each next search at the beginning of the same line, repeatedly replacing the same entire line, but does not hang.
2) Successively clicking Replace, then Find Next begins each next search at one character after the start of the previous replacement, repeatedly replacing the same entire line but skipping one additional character at the beginning each time, but does not hang.
3) Successively clicking Replace Next, TextPad first replaces the empty string at the current position (non-greedy behavior), then replaces the remainder of the line. Then Textpad alternates replacing the empty string at the beginning of the next line, followed by replacing the text on that line following the replacement text (or to the next line if that line is empty). And still, no hang.
4) Clicking Replace All makes one replacement on each line (including empty lines), and again, no hang. Here, each next search begins with the first (non-empty) character following the replacement text. So, the fact that the replacement includes the text in \1 (or $1) does not have any effect on further searches.
-
- Posts: 2461
- Joined: Sun Mar 02, 2003 9:22 pm
TextPad doesn't use $ for back-references in a regex: that's still \. For example \b(\w+)\s+\1\b matches repeated words.
But in a replacement expression (not a regex) it does uses the Perlish $n to represent what matched the nth captured substring (that's not a back-reference). But you can still use \n for that.
I used e as it's a character that is in the searched text ("oh dear"). It's to stop the regex matching the whole line, to make it easier to see what's happening. I used the brackets also to highlight what's happening.
Neither TextPad < 7 nor TextPad 7 behaves reasonably in all cases here. But the looping behaviour of Replace All in TextPad 7 is not unreasonable: it's doing precisely what the user asked it to do. The most you can say is that it's undesirable.
As has been said in this thread, elsewhere in these forums and across the web: if you use .* when you don't mean it you will come a cropper.
But in a replacement expression (not a regex) it does uses the Perlish $n to represent what matched the nth captured substring (that's not a back-reference). But you can still use \n for that.
I used e as it's a character that is in the searched text ("oh dear"). It's to stop the regex matching the whole line, to make it easier to see what's happening. I used the brackets also to highlight what's happening.
Neither TextPad < 7 nor TextPad 7 behaves reasonably in all cases here. But the looping behaviour of Replace All in TextPad 7 is not unreasonable: it's doing precisely what the user asked it to do. The most you can say is that it's undesirable.
As has been said in this thread, elsewhere in these forums and across the web: if you use .* when you don't mean it you will come a cropper.