DUPLICATES

Samuel · Post by **Samuel** » Mon Nov 27, 2000 8:38 am

Hello,
I have a delima, I havea list of about 15,000 names. I would like to run a macro that will tab over any word that is a duplicate. Is this more dificult that it sounds?
EXAMPLE

JOHNSON
SMITH
SMITH
HENRY
WILLIAMS
GOERGE
TONY
TOM
TOM
....

THANKS FOR ALL THE HELP!

SAM

Jeff Epstein · Post by **Jeff Epstein** » Mon Nov 27, 2000 3:08 pm

I assume you want this to end up like the following:

JOHNSON
SMITH
[...tab...]SMITH
HENRY
WILLIAMS
GOERGE
TONY
TOM
[...tab...]TOM
....

Please clarify if this is not correct. Anyway, the following Replacement Expression would be the solution you need:

Find What: $.*$\n\1
Replace With: \1\n\t\1
Regular Expression: Checked

But the "find what" regular expression is generating an error: "Invalid regular expression". Not sure why. The following regular expressions *are* valid:

$.*$\n
$.*$\1

There is nothing in the help documentation, as I can see, that gives any explanation to why this particular regular expression is bad. Ideas, anyone?

Roy Beatty · Post by **Roy Beatty** » Tue Nov 28, 2000 12:54 pm

a) Sam: You can remove duplicates by sorting the file and opting for "Delete Duplicate Lines".

b) Jeff: Regex, it seems, is not good at (perhaps not built for) multi-line matching and replacing. Even so, you've found a bug either in the regex engine or the product documentation.

Hope this helps,

Roy Beatty

Roy Beatty · Post by **Roy Beatty** » Wed Nov 29, 2000 10:26 am

Ok. Here's another way, what we call a "kludge",

First replace (regex): "\n" with "#" (or something similar not found in your data)
Also replace: "^" with "#"
To find duplicates, find using: "#$.*$#\1"

After you're done, replace "^#" with "", and then replace "#" with "\n".

Of course, you are "streaming" your discrete records, and you'll want to wrap your text -- while you're matching the dupes. This will affect the appearance of your text somewhat ...

If this will not do, then you might develop your own tool to make a copy of your original file, convert it to HTML, highlight the duplicate strings using HTML tags, and view the result in a web browser.

Consider developing the tool in Perl. That great book, _Mastering Regular Expressions_ by Jeffrey Friedl pp 233-236, discusses using modifiers (/m and /s) that affect how "^", "$", and "." treat "\n". The modifiers appear *after* the regex strings and are not regex interpreted.

Remember that TextPad hews to the POSIX regex standards. Those modifiers are *not* POSIX compliant -- they are Perl syntax.

I hope this helps,

Roy

Jeff Epstein · Post by **Jeff Epstein** » Wed Nov 29, 2000 12:09 pm

-------Roy Beatty:
> Ok. Here's another way, what we call a "kludge",

Actually, it's "klooj" :' )

Roy Beatty · Post by **Roy Beatty** » Wed Nov 29, 2000 1:43 pm

Ah, well, that was how we spelled it in ye olde pre-object oriented epoch.

Roy

Jeff Epstein · Post by **Jeff Epstein** » Mon Dec 04, 2000 7:22 am

Below is the text from a earlier post in this thread. Does anyone know if this truly is a bug or just something missing from the documentation?

---------------------------------------------------------------
---------------------------------------------------------------
...the following Replacement Expression would be the solution you need:

Find What: $.*$\n\1
Replace With: \1\n\t\1
Regular Expression: Checked

But the "find what" regular expression is generating an error: "Invalid regular expression". Not sure why. The following regular expressions *are* valid:

$.*$\n
$.*$\1

There is nothing in the help documentation, as I can see, that gives any explanation to why this particular regular expression is bad. Ideas, anyone?
---------------------------------------------------------------
---------------------------------------------------------------

Jeff Epstein · Post by **Jeff Epstein** » Thu Dec 28, 2000 6:33 am

It appears there is another thread related to this.

http://www.textpad.com/forum/read.php?f=1&i=195&t=193

As Roy Beatty noted, Regular Expressions are not good for multi-line finding/replacing. The documentation is not clear on this.

Hargobind Khalsa · Post by **Hargobind Khalsa** » Fri Jun 01, 2001 5:18 pm

On somewhat of a related topic, I'm having problems figuring out any way to search and replace characters over multiple lines. The following is my example:

From: <unknownuser@nowhere.com>
To: <applications@nowhere.com>
Date: Sun, 20 May 2001 15:48:01 -0400
Received: from mail by SMTP-Server;
Sun, 20 May 2001 15:48:03 -0400
Subject: E-mail Undergraduate Application
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
X-OriginalArrivalTime: 20 May 2001 19:48:03.0246 (UTC)

EnteringTerm = Fall
EnteringTermFallYear = 01
EnteringTermSpringYear =
...

This is an email application that has been sent to me, and what I want to do is remove EVERYTHING above the first line of data (EnteringTerm...). So, I hoped there might be some easy way to just write:
Regular Expression: From[[:cntrl:][:print:]]*X-OriginalArrivalTime.*\n
Replace: <nothing... I want to delete everything that was found>
I *assumed* that [:cntrl:] would include new line characters, but apparently it doesn't. And also, you can't put "\n" inside of bracket ranges ( i.e. [\na-zA-Z0-9]* ).
So, am I just SOL? Or is there actually a way to do this?

And on a further note, why wasn't the Regular Expression engine modified to include searching for content on multiple lines? I'm sure there are plenty of models out there that accomodate this functionality (the JavaScript language is a good example of this with it's Regular Expression replacement functions).

Thanks in advance for any help anyone could give on this issue, even if it's "There's no way on god's green earth that you can do something like this inside of Textpad." =)

Community

DUPLICATES

DUPLICATES

RE: DUPLICATES

RE: DUPLICATES

RE: DUPLICATES

RE: DUPLICATES

RE: DUPLICATES

RE: DUPLICATES

RE: DUPLICATES

RE: DUPLICATES