Page 1 of 1

Having trouble with replace of repetitive lines

Posted: Wed Jan 28, 2009 3:10 pm
by dlherrin
I am trying to process tab-delimited files output from Excel to prepare them for input into another program. The files have a prelude section, a repeated data section, and a postlude section. I want to get rid of the prelude/postlude entirely and transform each line of the repeated section.

My input file looks like this:
-------------------------------
Princeton University
Function Revenue by Category
Sorted by Function Date
Filtered by Site = TEST
"Tuesday, November 4, 2008"


Event # Event Name Account # Customer Name Function Date Function Type Category Charges

TEST200 Astronomy 9991111 Jane Doe 11/4/2008 Dinner Buffet Food "1,767.00"
TEST200 Astronomy 9991111 Jane Doe 11/4/2008 Dinner Buffet S & S 420.00
TEST200 Astronomy 9991111 Jane Doe 11/4/2008 Dinner Buffet Svc Chg 265.05

Total: "2,452.05"



"Printed: November 4, 2008 - 11:50 AM" Page 1
-------------------------------

I have tried the following seach and replace strings, but I can't seem to get this to work.

Search
Princeton University.*Charges$.*?\t{7}\r\n(?:(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t"?(.*?)"?$)*?\t{7}\r\n.*

Replace
"D"\t$1\t$2\t$3\t$4\t$5\t$6\t$7\t$8

Options:
* Regular expression
* Replacement format

(Above and below the repeating data lines are lines consisting solely of 7 tabs.) I would appreciate any suggestions you have to offer.

Thanks!
Dave

Posted: Thu Jan 29, 2009 3:53 pm
by ben_josephs
The way you use $ indicates that you are not using the option '.' does not match a newline character (so that '.' does match a newline), while your use of \r\n indicates that you are using that option (so that '.' does not match a newline). You must be consistent.

If you do not use that option, this will match the entire text:
Princeton University.*Charges$.*?\t{7}\r\n(((.*?)\t){7}"?(.*?)"?$)*?\r\n\t{7}\r\n.*
(Note the extra \r\n near the end.)

To match the lines you want to keep, you can use
Princeton University.*Charges$.*?\t{7}\r\n((((.*?)\t){7}"?(.*?)"?$)*?)\r\n\t{7}\r\n.*
and use $1 in the replacement.

I don't believe you can trim the top and bottom and get rid of the quote symbols at the same time.