DUPLICATES

General questions about using TextPad

Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard

Post Reply
Samuel

DUPLICATES

Post by Samuel »

Hello,
I have a delima, I havea list of about 15,000 names. I would like to run a macro that will tab over any word that is a duplicate. Is this more dificult that it sounds?
EXAMPLE

JOHNSON
SMITH
SMITH
HENRY
WILLIAMS
GOERGE
TONY
TOM
TOM
....

THANKS FOR ALL THE HELP!

SAM
Jeff Epstein

RE: DUPLICATES

Post by Jeff Epstein »

I assume you want this to end up like the following:

JOHNSON
SMITH
[...tab...]SMITH
HENRY
WILLIAMS
GOERGE
TONY
TOM
[...tab...]TOM
....

Please clarify if this is not correct. Anyway, the following Replacement Expression would be the solution you need:

Find What: \(.*\)\n\1
Replace With: \1\n\t\1
Regular Expression: Checked

But the "find what" regular expression is generating an error: "Invalid regular expression". Not sure why. The following regular expressions *are* valid:

\(.*\)\n
\(.*\)\1

There is nothing in the help documentation, as I can see, that gives any explanation to why this particular regular expression is bad. Ideas, anyone?
Roy Beatty

RE: DUPLICATES

Post by Roy Beatty »

a) Sam: You can remove duplicates by sorting the file and opting for "Delete Duplicate Lines".

b) Jeff: Regex, it seems, is not good at (perhaps not built for) multi-line matching and replacing. Even so, you've found a bug either in the regex engine or the product documentation.

Hope this helps,

Roy Beatty
Roy Beatty

RE: DUPLICATES

Post by Roy Beatty »

Ok. Here's another way, what we call a "kludge",

First replace (regex): "\n" with "#" (or something similar not found in your data)
Also replace: "^" with "#"
To find duplicates, find using: "#\(.*\)#\1"

After you're done, replace "^#" with "", and then replace "#" with "\n".

Of course, you are "streaming" your discrete records, and you'll want to wrap your text -- while you're matching the dupes. This will affect the appearance of your text somewhat ...


If this will not do, then you might develop your own tool to make a copy of your original file, convert it to HTML, highlight the duplicate strings using HTML tags, and view the result in a web browser.

Consider developing the tool in Perl. That great book, _Mastering Regular Expressions_ by Jeffrey Friedl pp 233-236, discusses using modifiers (/m and /s) that affect how "^", "$", and "." treat "\n". The modifiers appear *after* the regex strings and are not regex interpreted.

Remember that TextPad hews to the POSIX regex standards. Those modifiers are *not* POSIX compliant -- they are Perl syntax.

I hope this helps,

Roy
Jeff Epstein

RE: DUPLICATES

Post by Jeff Epstein »

-------Roy Beatty:
> Ok. Here's another way, what we call a "kludge",

Actually, it's "klooj" :' )
Roy Beatty

RE: DUPLICATES

Post by Roy Beatty »

Ah, well, that was how we spelled it in ye olde pre-object oriented epoch.

Roy
Jeff Epstein

RE: DUPLICATES

Post by Jeff Epstein »

Below is the text from a earlier post in this thread. Does anyone know if this truly is a bug or just something missing from the documentation?

---------------------------------------------------------------
---------------------------------------------------------------
...the following Replacement Expression would be the solution you need:

Find What: \(.*\)\n\1
Replace With: \1\n\t\1
Regular Expression: Checked

But the "find what" regular expression is generating an error: "Invalid regular expression". Not sure why. The following regular expressions *are* valid:

\(.*\)\n
\(.*\)\1

There is nothing in the help documentation, as I can see, that gives any explanation to why this particular regular expression is bad. Ideas, anyone?
---------------------------------------------------------------
---------------------------------------------------------------
Jeff Epstein

RE: DUPLICATES

Post by Jeff Epstein »

It appears there is another thread related to this.

http://www.textpad.com/forum/read.php?f=1&i=195&t=193

As Roy Beatty noted, Regular Expressions are not good for multi-line finding/replacing. The documentation is not clear on this.
Hargobind Khalsa

RE: DUPLICATES

Post by Hargobind Khalsa »

On somewhat of a related topic, I'm having problems figuring out any way to search and replace characters over multiple lines. The following is my example:


From: <unknownuser@nowhere.com>
To: <applications@nowhere.com>
Date: Sun, 20 May 2001 15:48:01 -0400
Received: from mail by SMTP-Server;
Sun, 20 May 2001 15:48:03 -0400
Subject: E-mail Undergraduate Application
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
X-OriginalArrivalTime: 20 May 2001 19:48:03.0246 (UTC)

EnteringTerm = Fall
EnteringTermFallYear = 01
EnteringTermSpringYear =
...


This is an email application that has been sent to me, and what I want to do is remove EVERYTHING above the first line of data (EnteringTerm...). So, I hoped there might be some easy way to just write:
Regular Expression: From[[:cntrl:][:print:]]*X-OriginalArrivalTime.*\n
Replace: <nothing... I want to delete everything that was found>
I *assumed* that [:cntrl:] would include new line characters, but apparently it doesn't. And also, you can't put "\n" inside of bracket ranges ( i.e. [\na-zA-Z0-9]* ).
So, am I just SOL? Or is there actually a way to do this?

And on a further note, why wasn't the Regular Expression engine modified to include searching for content on multiple lines? I'm sure there are plenty of models out there that accomodate this functionality (the JavaScript language is a good example of this with it's Regular Expression replacement functions).

Thanks in advance for any help anyone could give on this issue, even if it's "There's no way on god's green earth that you can do something like this inside of Textpad." =)
Post Reply