Page 1 of 2

Replace is destroying line endings

Posted: Thu Apr 11, 2013 4:49 pm
by geoffreykidd
I've used the following for years:

[^“�] -> (null)

It still works in 7, but it's leaving me with ONE line that's all “� pairs.

This is the first stage of a process I use to find unabalanced quote marks in dialogue, and it kills the process, which is 1. Kill anything not a quotemark. 2. Kill pairs of quotemarks. 3. Bookmark the lines with individual (unmatched) quotemarks. Undo 2. Undo 1.

Manually check the bookmarked paragraphs. (weed out false positives)

How do I avoid killing line endings? I don't DARE update my portable (working) copy until this works.

HELP!

Never mind. Sorry to have bothered anybody.

Posted: Thu Apr 11, 2013 5:16 pm
by geoffreykidd
Proper expression turned out to be: [^“�\r\n]

and everything else worked perfectly.

G_d bless Regex Buddy and Textpad!

Posted: Thu Apr 11, 2013 7:30 pm
by ben_josephs
Yes, in TextPad 7, [^...] matches any single character, including a newline character, that [...] doesn't match.
And \n now matches only linefeeds, not generic newlines, so you have to specify linefeeds and carriage returns separately.
(If it's not in a character set [...] you can use \R for a generic newline.)

You can do this in fewer steps:

To match all lines containing no unbalanced quotes:
^([^“�\r\n]|“[^“�\r\n]*�)*$

To match all lines containing an unbalanced quote:
(?!^([^“�\r\n]|“[^“�\r\n]*�)*$)^.+
(No doubt something simpler is possible.)

Edit: Removed redundant parentheses in second regex.

Posted: Thu Apr 11, 2013 7:58 pm
by geoffreykidd
I tried both regexes on my current project, and the first one successfully bookmarked everything except the lines that did need checking. All I needed to do was invert all bookmarks and F2 my way through the text.

It was beautiful! The bookmarked lines were a match-for-match with the lines my old technique left bookmarked. This also means I can now create a "bookmark unbalanced quotes" macro which will do both steps in one pass.

I still need to check the results manually because there's a typesetting convention that says next-paragraph-same-speaker ends without a closing quote. FYI, the first time I tried the technique back in 2005, I got 73 hits of which 67 were false positives because the characters tended to be long-winded. :) But those six true results were worth their weight in platinum to me.

The second macro (used on same file) didn't mark anything including the lines that did indeed have unbalanced quotes. If I can find the time, I may stuff it into RegEx Buddy for debugging.

Either way, I now have a fix for a vital tool in my proofreading workshop. I can't thank you enough for the help.

Posted: Thu Apr 11, 2013 8:07 pm
by ben_josephs
Please post an example of a line with unbalanced quotes that my second regex didn't match.

Posted: Thu Apr 11, 2013 8:26 pm
by geoffreykidd
This is interesting. Regex2 worked on a couple of short files and one humungous one (6000-odd lines), going its merry way quickly and efficiently.

However, copying and pasting plaintext from my current project and trying seemed to freeze the expression dead.

Possible character-encoding problem?

Posted: Thu Apr 11, 2013 11:27 pm
by geoffreykidd
I ran the second regex past RegexBuddy and got the following:

(?!^([^“�\r\n]|“[^“�\r\n]*�)*$)^.+

Options: ^ and $ match at line breaks

A POSIX Extended RE does not support lookaround «(?!^([^“�\r\n]|“[^“�\r\n]*�)*$)»
Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
Match the regular expression below and capture its match into backreference number 1 «([^“�\r\n]|“[^“�\r\n]*�)*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Note: You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «*»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^“�\r\n]»
Match a single character NOT present in the list below «[^“�\r\n]»
One of the characters ““�� «“�»
A carriage return character «\r»
A line feed character «\n»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «“[^“�\r\n]*�»
Match the character ““� literally «“»
Match a single character NOT present in the list below «[^“�\r\n]*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
One of the characters ““�� «“�»
A carriage return character «\r»
A line feed character «\n»
Match the character “�� literally «�»
Assert position at the end of a line (at the end of the string or before a line break character) «$»
Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
Match any single character that is not a line break character «.+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»

Posted: Fri Apr 12, 2013 8:29 am
by ben_josephs
TextPad 7 implements Perl-style regular expressions, not POSIX extended regular expressions. Perl-style regular expressions are more powerful, and they do support look ahead and look behind. Look in TextPad's help under Reference Information | Regular Expressions.

Posted: Fri Apr 12, 2013 3:12 pm
by geoffreykidd
Thank you. Will do.

Posted: Fri Apr 12, 2013 4:08 pm
by geoffreykidd
Re-tested Regex2 in Regex Buddy against my test file and it located the unbalanced lines fine. It also worked in Textpad with a copy of one of Horatio Alger's novels I got from Project Gutenberg. However, loading the test file into Textpad and running regex2 with "find" or "find next", I get "search passed end of file."

I'm beginning to think there's something funky about the file that may be causing the recognizer to go crazy. Could I send you a copy of the file for examination? If so, to whom would I address it, please? Thank you.

Posted: Fri Apr 12, 2013 5:00 pm
by ben_josephs
You could put it somewhere on the web and post a link to it.

Posted: Fri Apr 12, 2013 5:51 pm
by geoffreykidd
I've sent the file by a submit form on the main site, since its contents are somewhat confidential.

Posted: Fri Apr 12, 2013 6:23 pm
by ben_josephs
Do you mean you've sent it to Helios? I am nothing to do with Helios, so I won't see it.

Posted: Fri Apr 12, 2013 6:25 pm
by geoffreykidd
I thought you were one of their support people, so...

Posted: Sat Apr 13, 2013 2:44 am
by ak47wong
The only accounts associated with Helios are bbadmin and helios. Everyone else is just a regular user.