Bug in regular-expression replacement

lrichardlewis · Post by **lrichardlewis** » Mon Jun 23, 2014 8:21 pm

I just discovered a bug. The window contains just the two bytes "CD". The regular expression is the four-byte string "A*B*". The replacement expression is the two-byte string "yz". When I activate Replace All, TP goes bananas. It tries to insert an infinite number of "yz" before the "C"; i.e., at the beginning of the window. It does this regardless of the position of the cursor.

I use version 6.1.3 of TextPad.

ben_josephs · Post by **ben_josephs** » Mon Jun 23, 2014 9:55 pm

TextPad is doing what you asked it to do.

The regex "A*" matches any number of "A"s, including none of them. That is, it matches the empty string. "A*B*" also matches the empty string. And a regex that matches the empty string matches everywhere.

Starting at the beginning of the document, TextPad searches for the first match of the regex "A*B*". It finds a match, namely the empty string, right where it is, immediately in front of the text "CD", and replaces what is matched, the empty string, with the replacement "yz". Now, starting immediately after the inserted string, TextPad searches for the first match of the regex "A*B*". It finds a match, namely the empty string, right where it is, immediately in front of the text "CD", and replaces what is matched, the empty string, with the replacement "yz". And so on ad infinitum.

In TextPad 6 if you repeatedly Replace Next you will see the behaviour that perhaps you expected. Thus the behaviour of Replace Next is inconsistent with the behaviour of Replace All.

In TextPad 7 if you repeatedly Replace Next you will see the behaviour you have described unfold step by step. Thus the behaviour of Replace Next is consistent with the behaviour of Replace All.

Note: some editors do not behave as TextPad does: they treat replacement of the empty string as a special case and move forward one character after each such replacment. Whether you think this is a good thing depends on your point of view. On the one hand it is possible that it is doing what the user intended. On the other hand it is possible that it is not doing what the user intended, and it is certain that it is not doing what the user requested, and it is not consistent with the behaviour of replacements of non-empty strings.

Moral: don't use * unless you really mean it.

What are you trying to do?

lrichardlewis · Post by **lrichardlewis** » Mon Jun 23, 2014 11:02 pm

I was testing the logic to see whether it would fail. This is part of a current project to develop a regular-expression processor for a program written in a language that lacks one. At present, I have almost everything except alternation and replacement.

I understand that "A*" matches the null string, including the null string at the beginning off the text. I called that behavior a bug because I thought the text cursor would advance after the replacement. But, even if it did advance, the result would be unpleasant. I think it would match (and replace) each null string between text characters.

I appreciate your explanation and believe I was mistaken in calling the behavior a bug.

ben_josephs · Post by **ben_josephs** » Tue Jun 24, 2014 7:43 am

What you're doing has been done many times before, of course, as a feature of many languages and as stand-alone libraries. (I'm sure you're aware of the list at http://en.wikipedia.org/wiki/Comparison ... on_engines.)

So I suppose you're doing it for fun. But making it efficient is hard.

lrichardlewis · Post by **lrichardlewis** » Wed Jun 25, 2014 2:13 am

Thanks for the link. It has a lot of info and interesting comparisons. But none are in the language of the program (a non-linear optimization suite) I am currently working on. That program is a conversion of an earlier program from a language that provides RX's to a language that doesn't. Unfortunately, the target language (Fortran) is apparently not listed in the link. I am so far along now that I want to continue on my current path, which means that I would use only ideas or source code extracts from any existing implementation.

I am having fun, but also combining the fun with developing something that will get a lot of use.

Regarding efficiency, I believe the rewriting I have in mind to add alternation and replacement should greatly improve RX timing.