Hello,
I have a delima, I havea list of about 15,000 names. I would like to run a macro that will tab over any word that is a duplicate. Is this more dificult that it sounds?
EXAMPLE
JOHNSON
SMITH
SMITH
HENRY
WILLIAMS
GOERGE
TONY
TOM
TOM
....
THANKS FOR ALL THE HELP!
SAM
DUPLICATES
Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard
RE: DUPLICATES
I assume you want this to end up like the following:
JOHNSON
SMITH
[...tab...]SMITH
HENRY
WILLIAMS
GOERGE
TONY
TOM
[...tab...]TOM
....
Please clarify if this is not correct. Anyway, the following Replacement Expression would be the solution you need:
Find What: \(.*\)\n\1
Replace With: \1\n\t\1
Regular Expression: Checked
But the "find what" regular expression is generating an error: "Invalid regular expression". Not sure why. The following regular expressions *are* valid:
\(.*\)\n
\(.*\)\1
There is nothing in the help documentation, as I can see, that gives any explanation to why this particular regular expression is bad. Ideas, anyone?
JOHNSON
SMITH
[...tab...]SMITH
HENRY
WILLIAMS
GOERGE
TONY
TOM
[...tab...]TOM
....
Please clarify if this is not correct. Anyway, the following Replacement Expression would be the solution you need:
Find What: \(.*\)\n\1
Replace With: \1\n\t\1
Regular Expression: Checked
But the "find what" regular expression is generating an error: "Invalid regular expression". Not sure why. The following regular expressions *are* valid:
\(.*\)\n
\(.*\)\1
There is nothing in the help documentation, as I can see, that gives any explanation to why this particular regular expression is bad. Ideas, anyone?
RE: DUPLICATES
a) Sam: You can remove duplicates by sorting the file and opting for "Delete Duplicate Lines".
b) Jeff: Regex, it seems, is not good at (perhaps not built for) multi-line matching and replacing. Even so, you've found a bug either in the regex engine or the product documentation.
Hope this helps,
Roy Beatty
b) Jeff: Regex, it seems, is not good at (perhaps not built for) multi-line matching and replacing. Even so, you've found a bug either in the regex engine or the product documentation.
Hope this helps,
Roy Beatty
RE: DUPLICATES
Ok. Here's another way, what we call a "kludge",
First replace (regex): "\n" with "#" (or something similar not found in your data)
Also replace: "^" with "#"
To find duplicates, find using: "#\(.*\)#\1"
After you're done, replace "^#" with "", and then replace "#" with "\n".
Of course, you are "streaming" your discrete records, and you'll want to wrap your text -- while you're matching the dupes. This will affect the appearance of your text somewhat ...
If this will not do, then you might develop your own tool to make a copy of your original file, convert it to HTML, highlight the duplicate strings using HTML tags, and view the result in a web browser.
Consider developing the tool in Perl. That great book, _Mastering Regular Expressions_ by Jeffrey Friedl pp 233-236, discusses using modifiers (/m and /s) that affect how "^", "$", and "." treat "\n". The modifiers appear *after* the regex strings and are not regex interpreted.
Remember that TextPad hews to the POSIX regex standards. Those modifiers are *not* POSIX compliant -- they are Perl syntax.
I hope this helps,
Roy
First replace (regex): "\n" with "#" (or something similar not found in your data)
Also replace: "^" with "#"
To find duplicates, find using: "#\(.*\)#\1"
After you're done, replace "^#" with "", and then replace "#" with "\n".
Of course, you are "streaming" your discrete records, and you'll want to wrap your text -- while you're matching the dupes. This will affect the appearance of your text somewhat ...
If this will not do, then you might develop your own tool to make a copy of your original file, convert it to HTML, highlight the duplicate strings using HTML tags, and view the result in a web browser.
Consider developing the tool in Perl. That great book, _Mastering Regular Expressions_ by Jeffrey Friedl pp 233-236, discusses using modifiers (/m and /s) that affect how "^", "$", and "." treat "\n". The modifiers appear *after* the regex strings and are not regex interpreted.
Remember that TextPad hews to the POSIX regex standards. Those modifiers are *not* POSIX compliant -- they are Perl syntax.
I hope this helps,
Roy
RE: DUPLICATES
-------Roy Beatty:
> Ok. Here's another way, what we call a "kludge",
Actually, it's "klooj" :' )
> Ok. Here's another way, what we call a "kludge",
Actually, it's "klooj" :' )
RE: DUPLICATES
Ah, well, that was how we spelled it in ye olde pre-object oriented epoch.
Roy
Roy
RE: DUPLICATES
Below is the text from a earlier post in this thread. Does anyone know if this truly is a bug or just something missing from the documentation?
---------------------------------------------------------------
---------------------------------------------------------------
...the following Replacement Expression would be the solution you need:
Find What: \(.*\)\n\1
Replace With: \1\n\t\1
Regular Expression: Checked
But the "find what" regular expression is generating an error: "Invalid regular expression". Not sure why. The following regular expressions *are* valid:
\(.*\)\n
\(.*\)\1
There is nothing in the help documentation, as I can see, that gives any explanation to why this particular regular expression is bad. Ideas, anyone?
---------------------------------------------------------------
---------------------------------------------------------------
---------------------------------------------------------------
---------------------------------------------------------------
...the following Replacement Expression would be the solution you need:
Find What: \(.*\)\n\1
Replace With: \1\n\t\1
Regular Expression: Checked
But the "find what" regular expression is generating an error: "Invalid regular expression". Not sure why. The following regular expressions *are* valid:
\(.*\)\n
\(.*\)\1
There is nothing in the help documentation, as I can see, that gives any explanation to why this particular regular expression is bad. Ideas, anyone?
---------------------------------------------------------------
---------------------------------------------------------------
RE: DUPLICATES
It appears there is another thread related to this.
http://www.textpad.com/forum/read.php?f=1&i=195&t=193
As Roy Beatty noted, Regular Expressions are not good for multi-line finding/replacing. The documentation is not clear on this.
http://www.textpad.com/forum/read.php?f=1&i=195&t=193
As Roy Beatty noted, Regular Expressions are not good for multi-line finding/replacing. The documentation is not clear on this.
RE: DUPLICATES
On somewhat of a related topic, I'm having problems figuring out any way to search and replace characters over multiple lines. The following is my example:
From: <unknownuser@nowhere.com>
To: <applications@nowhere.com>
Date: Sun, 20 May 2001 15:48:01 -0400
Received: from mail by SMTP-Server;
Sun, 20 May 2001 15:48:03 -0400
Subject: E-mail Undergraduate Application
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
X-OriginalArrivalTime: 20 May 2001 19:48:03.0246 (UTC)
EnteringTerm = Fall
EnteringTermFallYear = 01
EnteringTermSpringYear =
...
This is an email application that has been sent to me, and what I want to do is remove EVERYTHING above the first line of data (EnteringTerm...). So, I hoped there might be some easy way to just write:
Regular Expression: From[[:cntrl:][:print:]]*X-OriginalArrivalTime.*\n
Replace: <nothing... I want to delete everything that was found>
I *assumed* that [:cntrl:] would include new line characters, but apparently it doesn't. And also, you can't put "\n" inside of bracket ranges ( i.e. [\na-zA-Z0-9]* ).
So, am I just SOL? Or is there actually a way to do this?
And on a further note, why wasn't the Regular Expression engine modified to include searching for content on multiple lines? I'm sure there are plenty of models out there that accomodate this functionality (the JavaScript language is a good example of this with it's Regular Expression replacement functions).
Thanks in advance for any help anyone could give on this issue, even if it's "There's no way on god's green earth that you can do something like this inside of Textpad." =)
From: <unknownuser@nowhere.com>
To: <applications@nowhere.com>
Date: Sun, 20 May 2001 15:48:01 -0400
Received: from mail by SMTP-Server;
Sun, 20 May 2001 15:48:03 -0400
Subject: E-mail Undergraduate Application
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
X-OriginalArrivalTime: 20 May 2001 19:48:03.0246 (UTC)
EnteringTerm = Fall
EnteringTermFallYear = 01
EnteringTermSpringYear =
...
This is an email application that has been sent to me, and what I want to do is remove EVERYTHING above the first line of data (EnteringTerm...). So, I hoped there might be some easy way to just write:
Regular Expression: From[[:cntrl:][:print:]]*X-OriginalArrivalTime.*\n
Replace: <nothing... I want to delete everything that was found>
I *assumed* that [:cntrl:] would include new line characters, but apparently it doesn't. And also, you can't put "\n" inside of bracket ranges ( i.e. [\na-zA-Z0-9]* ).
So, am I just SOL? Or is there actually a way to do this?
And on a further note, why wasn't the Regular Expression engine modified to include searching for content on multiple lines? I'm sure there are plenty of models out there that accomodate this functionality (the JavaScript language is a good example of this with it's Regular Expression replacement functions).
Thanks in advance for any help anyone could give on this issue, even if it's "There's no way on god's green earth that you can do something like this inside of Textpad." =)