Page 1 of 1
3 regular expression bugs matching '$' in PC vs UNIX files?
Posted: Mon Dec 08, 2014 5:51 pm
by brookh
(Please ignore this first example and see the corrected example in the 4th post from Wed Dec 24, 2014 10:13 am)
Create the following 5 line file (last line is empty):
And save 2 copies of it, one as file type "PC" and one as "UNIX".
BUG #1
Use the this regex to find all non-empty lines NOT ending with '/':
The PC file matches only line 2 (as it should). But the UNIX file (incorrectly) matches both lines 2 & 4.
BUG #2
Use the this regex to find all lines (incl. empty) not ending with '/':
Now the PC file also (incorrectly) matches both lines 2 and 4.
BUG #3?
Using the same regex as #2 to find all lines (incl. empty) not ending with '/':
Neither the PC nor UNIX files matches lines 3 or 5 (empty lines with and without terminating line-feeds). I'm pretty sure they should?
Re: 3 regular expression bugs matching '$' in PC vs UNIX fil
Posted: Mon Dec 08, 2014 8:00 pm
by MudGuard
brookh wrote:BUG #2
Use the this regex to find all lines (incl. empty) not ending with '/':
[^/] matches exactly one character that is not a /
Thus, the expression [^/]$ cannot match empty lines, as the match must contain at least that one character.
To find lines (incl. empty ones) that do not end with /, you need to use negative lookaround.
brookh wrote:BUG #3?
Using the same regex as #2 to find all lines (incl. empty) not ending with '/':
same error in your assumption, as it is the same regex.
Posted: Mon Dec 08, 2014 10:44 pm
by ben_josephs
And your BUG1 is not a bug.
In a Unix format file line 4 ends with
which is matched by
because
[^/] matches
CR and
LF.
In a PC format file line 4 ends with
which is not matched by
because dot doesn't match
CR or
LF.
Try
.[^/\n]$
Posted: Wed Dec 24, 2014 4:13 pm
by brookh
MudGuard wrote:[^/] matches exactly one character that is not a /
I know. I first found the bug because I
was using
[~/]$ and it kept matching line 4. It was only after I discovered that adding a
. in front of it gave correct results on PC that I got confused and started thinking that maybe [~/] could match an empty string.
Let's start over. Create the following 5 line file (last line is empty):
And save 2 copies of it, one as file type "PC" and one as "UNIX".
Testcase #1
Use this regex to find all non-empty lines NOT ending with '/'. This should only match line 2:
Both the PC and Unix files will match line 2 (as they should),
but they ALSO match line 4 (which they should not).
Testcase #2
Now use this regex, which should match any line of length 2 or greater, not ending in '/'. Again, this should only match line 2:
The PC file now only matches line 2 (as it should), but the Unix file still also matches line 4 (still wrong). That this testcase works differently on PC vs Unix files may provide some clue to understanding the bug.
Posted: Sat Dec 27, 2014 4:55 pm
by ben_josephs
[~/] does not match one character that is not
/ . It matches one character that is either
~ or
/ .
The expression that matches one character that is not
/ is
[^/] . It does not match the empty string. It
does match
CR; it
does match
LF.
I'm sorry you weren't able to follow my earlier explanation. Here's another attempt:
Code: Select all
line 1 line 2 line 3 line 4 EoF
---------------------------------- ------------------------------ ------ ---------------------------------- ---
Text (Windows) l i n e 1 / CR LF l i n e 2 CR LF CR LF l i n e 4 / CR LF
Matches of:
$ (5) $ $ $ $ $
[^/]$ (3) [^/]$ [^/]$ [^/]$
.[^/]$ (1) . [^/]$
Code: Select all
line 1 line 2 line 3 line 4 EoF
------------------------------ -------------------------- ------ ------------------------------ ---
Text (Unix) l i n e 1 / LF l i n e 2 LF LF l i n e 4 / LF
Matches of:
$ (5) $ $ $ $ $
[^/]$ (3) [^/]$ [^/]$
[^/]$
.[^/]$ (2) . [^/]$ . [^/]$
As you can see, there is no bug.
Posted: Sat Dec 27, 2014 6:02 pm
by brookh
([~/] was obviously a typo and was supposed to be [^/])
I see. I must be assuming either that the input text will be constrained to a single line at a time, or that the [^\] expression will not match newlines, or both. I must've adopted that misconception because . does not match newline characters. So an expression like .+ will be constrained to single lines even without using anchors. You have to explicitly include \n in a pattern to match multiple lines.
The correct expression to only match non-empty lines not ending with / would therefore be [^/\n]$, and that does indeed work correctly.
Thank you for helping me understand that!