Page 1 of 1

Finding DEADSPACE that could be tabs, spaces, even carrier r

Posted: Wed Jun 11, 2003 7:17 pm
by no.cache
Hi friends! Okay, it's time for me to return to CLEAN-UP DUTY and I need help with (what else heh) Mr. Regular Expression. I'm producing a mailing list, and I need to get rid of multiple lines of extraneous garbage. For this example I'll use the boundary words PURPLE and GRAPE.

"GRAPE" represents the first instance of a string of (something) that I want to keep and will always be preceded by the appearance of a COLON. I'm stuck as to what falls between that COLON and GRAPE because the deadspace rendered from my OCR manifests alternatively as space(s), tab(s), or (in rare cases) carrier return(s), eg.
~~~~~~:=deadspace=GRAPE~~~~~~~
or
~~~~~~:=carrierreturn(s)=
GRAPE~~~~~~~
or even
~~~~~~:=deadspace(s)carrierreturn(s)=
GRAPE~~~~~~~

I've (crudely) gotten as far as matching the literal instance of PURPLE up to =onespace=GRAPE, eg.
PURPLE.*\n.*\n.*\n.*\n.*: GRAPE

but since I can't reliably know how that =deadspace= between the colon and GRAPE will express itself, I'm stuck on what wildcard string I can use to locate it. It's complicated by the fact that there _could_ be carrier return(s) somewhere after the colon.

Thanks for any help you can provide!

Skye

Posted: Wed Jun 11, 2003 11:04 pm
by Bob Hansen
I have a question in advance of looking at this....I am not sure that the TP version of Regex supports wraps around return codes.

Is it possible for you to do a temporary search and replace of the return codes? Maybe change them to ~ or | or some other character that would be unique?

After doing Regex,we could go back and do another search and replace for ~ or | and replace with return code again.

Posted: Wed Jun 11, 2003 11:53 pm
by no.cache
Hi Bob,

As you can see from my GRAPE example, I've already resigned myself for S/R multiple \n's . . . one more won't kill me.

Would I be looking for ~~~:=deadspaceortabs=\n
=morepossibledeadspaceortabs=GRAPE

?

I just hoped there was a more efficient S/R that could grab that (possible) carrier return. I can't be too picky about this however because I know those \n's are prickly heh.

Lead on.

Skye Girl

Posted: Thu Jun 12, 2003 4:46 pm
by jeffy
Carrige (how the heck to you spell that?!) returns are not well supported in TextPad regular expressions. Replacing \n before running an RE is the best option I know of.

Posted: Thu Jun 12, 2003 4:50 pm
by no.cache
Jeffy, okay, could you help me _just_ locate the blank space before a \n? If it is either a tab(s) or space(s)?

This is just driving me nuts. I've tried [:blank:]*\n and can't get it to reliably locate _just_ the dead space. Arrrrrrghhhh!!!!!

:oops:

Skye

Posted: Thu Jun 12, 2003 4:57 pm
by jeffy
Try

[a space]$

$ means "the end of a line" and is more reliable than searching for \n.

Hope this is what you're looking for.

Posted: Thu Jun 12, 2003 5:01 pm
by jeffy
Also, I'm thinking you might want

[ ]+$

Where there could be more than one space before the end of the line.

Going further...

[ a-z]+$ would find one or more space, AND/OR lowercase letter existing before the end of the line.

Replace "+" with "*" if you need zero or more.

Posted: Thu Jun 12, 2003 5:03 pm
by jeffy
Check out my TextPad Regular Expression FAQ if you need more information:

http://www.jeffyjeffy.com/code/textpad/documentation/regular_expression_faq.html

Posted: Thu Jun 12, 2003 5:04 pm
by jeffy
Hey, I just rapid fire created three posts, so what's one more...?

:' )

Posted: Thu Jun 12, 2003 5:07 pm
by no.cache
Jeff, Bob . . . got it!

[ \t]+$

Now I don't know why when I first tried this it didn't work (doubtless I didn't have the expression correct) but I just tried it again — this time testing it by inserting various combinations of tabs and spaces — and it works great.

:D

Also guys, I know TP would perform these S/R's more efficiently if I were to remove those carrier returns . . . but you cannot imagine (yes you can heh) how much more difficult it makes my first edit of these crummy OCR'd files not to at least have some crude semblance of their graphic shape.

I gotta stitch as the last step or I'll lose my mind. :wink:

On to the next headache.

Skye-a-Watha

Posted: Thu Jun 12, 2003 5:09 pm
by no.cache
jeffy wrote:Check out my TextPad Regular Expression FAQ if you need more information:

http://www.jeffyjeffy.com/code/textpad/documentation/regular_expression_faq.html
The World Famous JeffyJeff FAQ! :D I had completely forgot about this fantastic page. Got it bookmarked now Jeff.

Hugs,
Skye King

Posted: Thu Jun 12, 2003 5:16 pm
by jeffy
You just single-handedly made my month, Skye.

:' )

Posted: Thu Jun 12, 2003 5:19 pm
by no.cache
Cool! :D

But trust me, the worst is yet to come. As Arnold said:
I'll be back.

groan :wink:

Text manipulation

Posted: Tue Jun 17, 2003 12:02 pm
by Nial
> But trust me, the worst is yet to come. As Arnold said:
> I'll be back.

Skye,

Looking at all your posts asking for help about regexps, I'd say
you'd probably be better off having a look at Perl. It's designed
to do exactly the sort of things you're trying to do, and isn't
hard to pick up (if you take things one step at a time).

It's also got better regexp handling than textpad.

See 'Links' on my web site for some Perl books, a web based
tutorial and details on where to download Perl free.

http://www.nialstewartdevelopments.co.uk

Nial.

Posted: Tue Jun 17, 2003 12:51 pm
by no.cache
I'll definitely look into that Nial. Right now I'm jammed by a deadline but I'll come over and visit once I'm done with this first OCR project, because there will be dozens to follow.

Skye-a-Watha
Image