Regular Expression for multiple lines

alg · Post by **alg** » Tue Jan 06, 2004 4:28 pm

I am trying to write a regular expression that will replace or delete multiple lines that always start and end with a fixed sequence of characters. For example:

START OF DATA
.
.
.
END OF DATA

The problem I have is that . does not match line terminators. Also, I can't have a class [.\n]. Any suggestions?

I check the forum without any luck. Thanks.

Bob Hansen · Post by **Bob Hansen** » Wed Jan 07, 2004 5:56 am

Have you tried changing the "\n" at the end of the lines (except the last one, END OF DATA) to something unique like "Q3Qk" which will make those sections one long string and overcome the problem of unknown multiple "\n"

alg · Post by **alg** » Wed Jan 07, 2004 3:02 pm

OK, I eliminated the new lines. I am left with a regular expression like this:

START.+END

where "START" and "END" are the fixed sequence of characters and any characters can be in between. I tested this in another regular expression processor (Java) and it works - but not in TextPad.

Any more ideas?

Thanks.

Bekah · Post by **Bekah** » Thu Jan 08, 2004 2:08 pm

Hi Textpad people,
I think I want to do the same thing as alg.
How do I include a newline in a replacement expression?
Thanks,
Bekah

Bob Hansen · Post by **Bob Hansen** » Fri Jan 09, 2004 4:37 pm

If you have other lines in the document besides the START OF DATA and END OF DATA groups, then be sure to only replace "\n" on the selected text for those lines.

If you replaced all \n with a unique code then you should have one long string that now replaces those blocks. Now search for "START.*" and replace with nothing.

alg · Post by **alg** » Fri Jan 09, 2004 6:09 pm

Hi Bob,

I need to specify a start and an end, e.g. "START.*END" and delete everything in-between as well as the start and end. Unfortunately, the file is very large and the pattern occurs many times. I can always do this manually. But I am sure the problem will come up again and I am looking for a simple general solution.

In addition "START.*END" does not work even if the start and end are on a single line. It seems to me that the regular expression processor is deficient - but, admittedly, I am not an expert in such matters.

Thanks.

talleyrand · Post by **talleyrand** » Fri Jan 09, 2004 9:21 pm

How 'bout just a simple program to handle it?

Install Python and you're good to go.

Copy this code and paste into Textpad.
Save it to something like C:\alg.py
Update the filename variables and your marker criteria. If you need a more complex search, say a regular expression, it wouldn't be hard to implement.

C:\>python alg.py

You can also run it straight out of Texpad. Set the command to wherever you installed Python (probably C:\python23\python.exe)

Code: Select all

import sys

def processFile(fileName, outFileName, beginMarker, endMarker):
    """
    Process our file.
    """
    ignore = False
    out = open(outFileName, 'w')

    for currentLine in open(fileName, 'r').readlines():
        #if the marker not a consistent case, uncomment the following
        # currentLine = currentLine.lower()
        print currentLine[:-1]

        if (currentLine.find(beginMarker) >= 0):
            ignore = True

        if (not ignore):
            #write line
            out.write(currentLine)

        if (currentLine.find(endMarker) >= 0):
            ignore = False

    #clean up
    out.close()

def main():
    fileName = r'c:\bfellows\alg_test.txt'
    outFileName = r'c:\bfellows\alg_out.txt'
    beginMarker = 'START OF DATA'
    endMarker = 'END OF DATA'
    processFile(fileName, outFileName, beginMarker, endMarker)

if __name__ == '__main__':
    main()

trids · Post by **trids** » Wed Jan 14, 2004 3:10 pm

I reported a problem with 4.6.2 and regexp to replace text across multiple lines: http://textpad.com/forum/viewtopic.php? ... highlight=

... could be related .. ?

alg · Post by **alg** » Wed Jan 14, 2004 3:22 pm

When someone suggested that I write a simple program, I came to the conclusion - perhaps inaccurately - that there isn't an interest in nailing down this problem.

My impression is that this is a problem with the regualar expression processor. The regular expression "START.*END" does not work - period. This is the case whether START and END are on the same line or not. When I use the same regular expression in a Java program - it works!

Bob Hansen · Post by **Bob Hansen** » Wed Jan 14, 2004 4:26 pm

The regular expression "START.*END" does not work - period. This is the case whether START and END are on the same line or not

I cannot duplicate this problem.
===================================
Regex works fine for the following test lines

START some data, more data, to the END
START some data, more data, to the END
START some data, more data, to the END
START some data, more data, to the END
START some data, more data, to the END
START some data, more data, to the END
START some data, more data, to the END
START some other data, more other data, to the END
START somemore data, more different data, to the END
START some new data, more new data, to the END
START some old data, more old data, to the END
START some repeat data, more repeat data, to the END
START some unique data, more unique data, to the END

Searching for "START.*END" (without quotes) replaces each line with a blank line.
Searching for "START.*END\n" (without quotes) deletes each line completely.

Selection of POSIX does not matter, same results in both instances.
===============================
Steps taken to make this happen:
From the Main Menu, using Search, Replace, entering value into "Find What" field in the Replace window. Making sure the "Replace With" field is blank. Conditions box have selected "text" and "Regular expression", Scope has selected Active document. Click on Replace All.
Replacements happen as noted above.

Replacing "\n" with a unique value "~" also works on these lines. Using the Scope of Selected Text vs. Active document:
Select text, (lines 4-10)
Replace \n with ~ (combines to START.........END).
Select text (the block just modified START.....END)
Replace START.*END with nothing (all of line is replaced with blank line).
=====================================
If any of this is not working for you.....
What are your values, what steps are you taking? What are your results?

does not work - period

is a bit vague, no error message? no cursor movement? no focus change? Specifics are helpful here.
========================

that there isn't an interest in nailing down this problem.

If that were the case, there would have been no responses to your request. There must be some interest, the item has been looked at over 125 times.

talleyrand · Post by **talleyrand** » Wed Jan 14, 2004 4:37 pm

alg wrote:When someone suggested that I write a simple program, I came to the conclusion - perhaps inaccurately - that there isn't an interest in nailing down this problem.

I didn't suggest you write a program, I suggested you run the program I just wrote. It appeared to be working fine for eliminating stuff. Bob already wrote a response but I would second his findings that the regular expression appears to be working for me as well.

s_reynisson · Post by **s_reynisson** » Wed Jan 14, 2004 4:42 pm

Just to confirm B&T's notes, I can not duplicate this problem.
Find and Replace with and without \n works fine.
What version of TP are you using? 4.7.2? HTH

alg · Post by **alg** » Wed Jan 14, 2004 5:23 pm

My apologies to everyone. My foot is in my mouth. I don't know what I was doing wrong. I thought I tried the regular expression several times and it didn't work for me. But when I followed the posted instructions, it worked!

walidaly · Post by **walidaly** » Sat Jan 17, 2004 5:03 pm

hello there(1st post)
I can't get this to work, you mean it should look like
replace
START OF DATA .*\n.*[^END]
with
START OF DATA

then it replaces one line and waits?

why you just ad a new expression like # for all char including newline?!

so that I can edit something like

Code: Select all

<hello>
I am trying to write a regular expression that will replace or delete multiple lines that always start and end with a fixed sequence of characters. For example: 

START OF DATA 
. 
. 
. 
END OF DATA 

The problem I have is that . does not match line terminators. Also, I can't have a class [.\n]. Any suggestions? 

I check the forum without any luck. Thanks.
<bye>

so I replace
<hello>#*<bye>
with
<hello><bye>

or let it even accept \n*

alg · Post by **alg** » Sat Jan 17, 2004 5:59 pm

The trick to deleting to replacing a sequence of characters that extends over multiple lines was given by Bob Hansen. Namely, first you have to replace all new lines with a unique code. Then you can use a regular expression like START.*END to find and replace the sequence that starts with START and ends with END. The final step would be to replace any remaining unique codes with new lines.

I hope this helps.