Multi-Line Regex

General questions about using WildEdit

Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard

Post Reply
User avatar
BenjiSmith
Posts: 49
Joined: Fri Jan 16, 2004 9:37 pm
Contact:

Multi-Line Regex

Post by BenjiSmith »

It looks like the multi-line regular expressions are working correctly.

So, with an expression like this (which can be used to remove duplicate lines from a non-sorted file):

Code: Select all

Search Expression:       \n(.+)\n(.*\n)?\1
Replacement Expression:       \n$1\n$2\n
It works pretty well. (Or, at least, it works with very small files. In a test file that was just under 10K, I got back an error reporting "Memory exhausted").
User avatar
bbadmin
Site Admin
Posts: 809
Joined: Mon Feb 17, 2003 8:54 pm
Contact:

Post by bbadmin »

I'vev not been able to reproduce this, so it may be data dependent. However, by default ".+" matches to the end of the file, so do you have the option "'.' does not match a newline character" checked?

Keith MacDonald
Helios Software Solutions
mo
Posts: 306
Joined: Tue Mar 11, 2003 1:40 am

Post by mo »

Default setup, regular expression, iso-8859-1, search subfolders

When I put the below (as is) into the test box it finds and replaces in the test box no problem (whether "regular expression" is checked or not). When I try it on actual files it comes up no changes (where at least one should be valid as the search-for was copied directly from it).

I have tried escaping the "."s and putting in "\n"s at the ends of newlines. When I do that it does not find the text in the test box or find it in actual files.

I am sure this is something simple, but would appreciate being told what it is.

A help topic on multi-line find and replaces would be helpful...especially confusing is the \n thing and when \r\n is needed and not.

Search For:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="en">

<head>
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Style-Type" content="text/css">
<meta name="description" content="Presents words.">
<meta name="keywords" content="word, word">
<link rel="shortcut icon" href="../favicon.ico">
<link rel="copyright" href="../copyrightstatement.htm">
<link rel="stylesheet" type="text/css" href="../styles/common.css" />



Replace with:


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="en">

<head>
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Style-Type" content="text/css">
<link rel="stylesheet" type="text/css" href="../../../admin/styles/common.css" />
Best Wishes!
Mike Olds
ben_josephs
Posts: 2457
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

bbadmin wrote:I've not been able to reproduce this, so it may be data dependent. However, by default ".+" matches to the end of the file, so do you have the option "'.' does not match a newline character" checked?
Make sure that option is not checked and try:

Code: Select all

000000000
000000001
000000002
...
000000999
(I know an editor that can produce that quite quickly...)

A subexpression such as ".+" does not necessarily match to the end of the text. If the recogniser reaches the end without finding a match, it will backtrack and try again with a shorter match (if there is one) for such subexpressions. And again, and again, and again... This can be extremely expensive in both time and space. Hence the "Memory exhausted" message. It is a warning that constructing regexes with pathological behaviour is rather easy!

So either the option you mention should be selected or occurrences of ".+" and ".*" in the regex should be replaced by "[^\n]+" and "[^\n]*".
User avatar
s_reynisson
Posts: 940
Joined: Tue May 06, 2003 1:59 pm

Post by s_reynisson »

Just to confirm your findings mo. The moment I use just the first line it works fine so it must be down to some irregularities regarding the newline char. I'm using TP4.7.2 to save the file as PC-ANSI and WE1.0 with encoding windows-1252.
(The regex <!DOCTYPE HTML PUBLIC.+?common.css" /> works, if you can use that)
Then I open up and see
the person fumbling here is me
a different way to be
ben_josephs
Posts: 2457
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

s_reynisson wrote:Just to confirm your findings mo... it must be down to some irregularities regarding the newline char.
Yes. It seems that the problem arises if the file has CRLF or CR line endings, but not if it has LF line endings.

If you select "Regular expression" and explicitly put a \r (CR) at the end of each line, it works as expected.

This makes multi-line searches on non-unix-style files a little awkward...
mo
Posts: 306
Joined: Tue Mar 11, 2003 1:40 am

Post by mo »

thanks sr & bj!

The developers should give a look see at Advanced Find and Replace's ( http://www.abacre.net ) method of handling multi-line s&rs. They use some kind of very comprehensive (?regex) to make it possible to just copy and paste nearly any multi-line clip and do a s&r without editing.

They do this without limiting the ability to create custom search fors.

(Not pushing this program...it has way too many other problems to be really convenient to use and I have high hopes for WE because of it's association with Helios).
Best Wishes!
Mike Olds
Post Reply