Page 1 of 1

Multi-Line Regex

Posted: Thu May 27, 2004 11:35 pm
by BenjiSmith
It looks like the multi-line regular expressions are working correctly.

So, with an expression like this (which can be used to remove duplicate lines from a non-sorted file):

Code: Select all

Search Expression:       \n(.+)\n(.*\n)?\1
Replacement Expression:       \n$1\n$2\n
It works pretty well. (Or, at least, it works with very small files. In a test file that was just under 10K, I got back an error reporting "Memory exhausted").

Posted: Fri May 28, 2004 10:55 am
by bbadmin
I'vev not been able to reproduce this, so it may be data dependent. However, by default ".+" matches to the end of the file, so do you have the option "'.' does not match a newline character" checked?

Keith MacDonald
Helios Software Solutions

Posted: Sat May 29, 2004 2:34 pm
by mo
Default setup, regular expression, iso-8859-1, search subfolders

When I put the below (as is) into the test box it finds and replaces in the test box no problem (whether "regular expression" is checked or not). When I try it on actual files it comes up no changes (where at least one should be valid as the search-for was copied directly from it).

I have tried escaping the "."s and putting in "\n"s at the ends of newlines. When I do that it does not find the text in the test box or find it in actual files.

I am sure this is something simple, but would appreciate being told what it is.

A help topic on multi-line find and replaces would be helpful...especially confusing is the \n thing and when \r\n is needed and not.

Search For:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="en">

<head>
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Style-Type" content="text/css">
<meta name="description" content="Presents words.">
<meta name="keywords" content="word, word">
<link rel="shortcut icon" href="../favicon.ico">
<link rel="copyright" href="../copyrightstatement.htm">
<link rel="stylesheet" type="text/css" href="../styles/common.css" />



Replace with:


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="en">

<head>
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Style-Type" content="text/css">
<link rel="stylesheet" type="text/css" href="../../../admin/styles/common.css" />

Posted: Sat May 29, 2004 7:51 pm
by ben_josephs
bbadmin wrote:I've not been able to reproduce this, so it may be data dependent. However, by default ".+" matches to the end of the file, so do you have the option "'.' does not match a newline character" checked?
Make sure that option is not checked and try:

Code: Select all

000000000
000000001
000000002
...
000000999
(I know an editor that can produce that quite quickly...)

A subexpression such as ".+" does not necessarily match to the end of the text. If the recogniser reaches the end without finding a match, it will backtrack and try again with a shorter match (if there is one) for such subexpressions. And again, and again, and again... This can be extremely expensive in both time and space. Hence the "Memory exhausted" message. It is a warning that constructing regexes with pathological behaviour is rather easy!

So either the option you mention should be selected or occurrences of ".+" and ".*" in the regex should be replaced by "[^\n]+" and "[^\n]*".

Posted: Sat May 29, 2004 8:21 pm
by s_reynisson
Just to confirm your findings mo. The moment I use just the first line it works fine so it must be down to some irregularities regarding the newline char. I'm using TP4.7.2 to save the file as PC-ANSI and WE1.0 with encoding windows-1252.
(The regex <!DOCTYPE HTML PUBLIC.+?common.css" /> works, if you can use that)

Posted: Sun May 30, 2004 9:56 am
by ben_josephs
s_reynisson wrote:Just to confirm your findings mo... it must be down to some irregularities regarding the newline char.
Yes. It seems that the problem arises if the file has CRLF or CR line endings, but not if it has LF line endings.

If you select "Regular expression" and explicitly put a \r (CR) at the end of each line, it works as expected.

This makes multi-line searches on non-unix-style files a little awkward...

Posted: Sun May 30, 2004 1:19 pm
by mo
thanks sr & bj!

The developers should give a look see at Advanced Find and Replace's ( http://www.abacre.net ) method of handling multi-line s&rs. They use some kind of very comprehensive (?regex) to make it possible to just copy and paste nearly any multi-line clip and do a s&r without editing.

They do this without limiting the ability to create custom search fors.

(Not pushing this program...it has way too many other problems to be really convenient to use and I have high hopes for WE because of it's association with Helios).