3 regular expression bugs matching '$' in PC vs UNIX files?

General questions about using TextPad

Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard

Post Reply
User avatar
brookh
Posts: 11
Joined: Sun Oct 29, 2006 9:26 am

3 regular expression bugs matching '$' in PC vs UNIX files?

Post by brookh »

(Please ignore this first example and see the corrected example in the 4th post from Wed Dec 24, 2014 10:13 am)
Create the following 5 line file (last line is empty):

Code: Select all

line 1/
line 2

line 4/

And save 2 copies of it, one as file type "PC" and one as "UNIX".

BUG #1
Use the this regex to find all non-empty lines NOT ending with '/':

Code: Select all

.[^/]$
The PC file matches only line 2 (as it should). But the UNIX file (incorrectly) matches both lines 2 & 4.

BUG #2
Use the this regex to find all lines (incl. empty) not ending with '/':

Code: Select all

[^/]$
Now the PC file also (incorrectly) matches both lines 2 and 4.

BUG #3?
Using the same regex as #2 to find all lines (incl. empty) not ending with '/':

Code: Select all

[^/]$
Neither the PC nor UNIX files matches lines 3 or 5 (empty lines with and without terminating line-feeds). I'm pretty sure they should?
Last edited by brookh on Wed Dec 24, 2014 4:51 pm, edited 1 time in total.
User avatar
MudGuard
Posts: 1295
Joined: Sun Mar 02, 2003 10:15 pm
Location: Munich, Germany
Contact:

Re: 3 regular expression bugs matching '$' in PC vs UNIX fil

Post by MudGuard »

brookh wrote:BUG #2
Use the this regex to find all lines (incl. empty) not ending with '/':

Code: Select all

[^/]$
[^/] matches exactly one character that is not a /
Thus, the expression [^/]$ cannot match empty lines, as the match must contain at least that one character.

To find lines (incl. empty ones) that do not end with /, you need to use negative lookaround.
brookh wrote:BUG #3?
Using the same regex as #2 to find all lines (incl. empty) not ending with '/':

Code: Select all

[^/]$
same error in your assumption, as it is the same regex.
ben_josephs
Posts: 2460
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

And your BUG1 is not a bug.

In a Unix format file line 4 ends with

Code: Select all

  /     LF    EoF
which is matched by

Code: Select all

  .     [^/]  $
because [^/] matches CR and LF.

In a PC format file line 4 ends with

Code: Select all

  /     CR    LF    EoF
which is not matched by

Code: Select all

        .     [^/]  $
because dot doesn't match CR or LF.

Try
.[^/\n]$
User avatar
brookh
Posts: 11
Joined: Sun Oct 29, 2006 9:26 am

Post by brookh »

MudGuard wrote:[^/] matches exactly one character that is not a /

I know. I first found the bug because I was using [~/]$ and it kept matching line 4. It was only after I discovered that adding a . in front of it gave correct results on PC that I got confused and started thinking that maybe [~/] could match an empty string.

Let's start over. Create the following 5 line file (last line is empty):

Code: Select all

line 1/
line 2

line 4/

And save 2 copies of it, one as file type "PC" and one as "UNIX".

Testcase #1
Use this regex to find all non-empty lines NOT ending with '/'. This should only match line 2:

Code: Select all

[^/]$
Both the PC and Unix files will match line 2 (as they should), but they ALSO match line 4 (which they should not).

Testcase #2
Now use this regex, which should match any line of length 2 or greater, not ending in '/'. Again, this should only match line 2:

Code: Select all

.[^/]$
The PC file now only matches line 2 (as it should), but the Unix file still also matches line 4 (still wrong). That this testcase works differently on PC vs Unix files may provide some clue to understanding the bug.
ben_josephs
Posts: 2460
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

[~/] does not match one character that is not / . It matches one character that is either ~ or / .
The expression that matches one character that is not / is [^/] . It does not match the empty string. It does match CR; it does match LF.

I'm sorry you weren't able to follow my earlier explanation. Here's another attempt:

Code: Select all

                line 1                              line 2                          line 3  line 4                              EoF
                ----------------------------------  ------------------------------  ------  ----------------------------------  ---
Text (Windows)  l   i   n   e       1   /   CR  LF  l   i   n   e       2   CR  LF  CR  LF  l   i   n   e       4   /   CR  LF  
Matches of:
  $      (5)                                $                               $       $                                   $       $
  [^/]$  (3)                                                            [^/]$   [^/]$                                       [^/]$
  .[^/]$ (1)                                                        .   [^/]$                                                    

Code: Select all

                line 1                          line 2                      line 3  line 4                          EoF
                ------------------------------  --------------------------  ------  ------------------------------  ---
Text (Unix)     l   i   n   e       1   /   LF  l   i   n   e       2   LF  LF      l   i   n   e       4   /   LF  
Matches of:
  $      (5)                                $                           $   $                                   $   $
  [^/]$  (3)                                                        [^/]$                                       [^/]$
                                                                        [^/]$
  .[^/]$ (2)                                                    .   [^/]$                                   .   [^/]$
As you can see, there is no bug.
User avatar
brookh
Posts: 11
Joined: Sun Oct 29, 2006 9:26 am

Post by brookh »

([~/] was obviously a typo and was supposed to be [^/])

I see. I must be assuming either that the input text will be constrained to a single line at a time, or that the [^\] expression will not match newlines, or both. I must've adopted that misconception because . does not match newline characters. So an expression like .+ will be constrained to single lines even without using anchors. You have to explicitly include \n in a pattern to match multiple lines.

The correct expression to only match non-empty lines not ending with / would therefore be [^/\n]$, and that does indeed work correctly.

Thank you for helping me understand that!
Post Reply