Exclude string, not just chars, from search

gdutoit · Post by **gdutoit** » Tue May 08, 2007 4:36 pm

I want to find everything between a start and end code, e.g.:
<startcode> - whole lot of text including angle brackets etc - <endcode>

Of course if I search for: <startcode>[^<endcode>]*<endcode>
it's going to stop at any of the characters between [ and ]

How do I make it exclude only the entire string <endcode>?

(I suspect the answer may be elementary, but I'm afraid I haven't been able to find/figure it out.)

ben_josephs · Post by **ben_josephs** » Tue May 08, 2007 5:19 pm

If the text to be matched spans an unknown number of lines you can't do that directly in TextPad, as it is incapable of matching text containing an arbitrary number of newlines.

If the text is all on one line, then <startcode>.*<endcode> will match everything from the first <startcode> to the last <endcode>, inclusive, because .* matches greedily; it matches as much as possible.

And <startcode>(.*)<endcode> will match the same thing, while capturing what is between <startcode> and <endcode> so that you can use that captured text in a replacement, where it is represented as \1.

This assumes you are using Posix regular expression syntax:

Configure | Preferences | Editor

[X] Use POSIX regular expression syntax

If you need to match text that spans arbitrary number of newlines, you might try WildEdit (http://www.textpad.com/products/wildedit/), which uses a far more powerful regular expression engine than TextPad.

gdutoit · Post by **gdutoit** » Tue May 08, 2007 7:45 pm

However, if the line contains two instances of text between <startcode> and <endcode>, e.g.:

<startcode> text <endcode> blah blah <startcode> more text <endcode>

this greedy search will find the entire line instead of the individual coded portions. That's why I'm looking for a "not <endcode>" option in the search string, similar to, e.g.

{[^{}]*}

to find the individual portions in curly brackets in

{text} text {text} text

ben_josephs · Post by **ben_josephs** » Tue May 08, 2007 9:39 pm

You can't do this is TextPad, but you can in WildEdit, with a non-greedy repeat:
<startcode>.*?<endcode>

In WildEdit .*? matches non-greedily; it matches the shortest possible substring that allows the whole expression to match.

gdutoit · Post by **gdutoit** » Thu May 10, 2007 7:17 am

Thanks man, that will certainly save some frustration.

(I'm surprised, though, that an option for non-greedy search or something along the lines of [^"string"]*, where everything between quotes is excluded, isn't standard fare in REs.)

ben_josephs · Post by **ben_josephs** » Thu May 10, 2007 12:35 pm

They are available in many recent regular expression recognisers. I showed you a non-greedy quantifier (*?). To find a match that doesn't contain text that matches a particular regular subexpression you can use negative lookahead assertions ((?!...)). For example, to solve your problem:
<startcode>(?:(?!<startcode>|<endcode>).)*<endcode>
(This also handles nested occurrences of <startcode>...<endcode> properly.)

Both of these constructs are available in WildEdit.

For functionality to be added to a regular expression recogniser it isn't sufficient that the proposed functionality is convenient. It has to fit into the underlying regular expression concept in such a way that its essential efficiency is maintained.

gdutoit · Post by **gdutoit** » Thu May 10, 2007 12:47 pm

Thanks again.

I haven't used WildEdit (it seemed to me that a utility like BK ReplacEm, which is free and can do long lists of replacements on multiple files, makes more sense.

But it seems I should give WildEdit a try. Will download the trial version immediately!

gdutoit · Post by **gdutoit** » Thu May 10, 2007 1:05 pm

Well I gave WildEdit a try, and it's all there in the Help!

Suppose I should've gone there first, and saved you some trouble. (But, in mitigation, I didn't expect WildEdit's functionality to be different from TextPad.)

ben_josephs · Post by **ben_josephs** » Thu May 10, 2007 1:23 pm

This, posted here in various forms a number of times, may be of interest:

There are many regular expression tutorials on the web, and you will find recommendations for some of them if you search this forum.

A standard reference for regular expressions is

Friedl, Jeffrey E F
Mastering Regular Expressions, 3rd ed
O'Reilly, 2006
ISBN 10: 0-596-52812-4
http://regex.info/

But be aware that the regular expression recogniser used by TextPad is rather weak by the standards of recent tools, so you may get frustrated if you discover a handy trick that doesn't work in TextPad. The recogniser that WildEdit (http://www.textpad.com/products/wildedit/) uses (Boost) is far more powerful.

Edit: updated to 3rd edition.

Post by **MudGuard** » Thu May 10, 2007 1:56 pm

There is a third ed from last August (which I haven't got, so I can't say whether it is better than 2nd ...)

ben_josephs · Post by **ben_josephs** » Thu May 10, 2007 2:40 pm

So there is! Thanks. Earlier posting updated.