Page 1 of 1

Deleting Duplicate Lines Using Tagged Expressions

Posted: Tue Jul 03, 2012 4:23 pm
by Adrian
Hello,

I would like to remove duplicate lines from a text file. Thus, if my input file looks like this:

Code: Select all

line1
row2
row2
row3
line4
I would like to generate output like this:

Code: Select all

line1
row2
row3
line4
Using Google I found a small regular expression. Consequently, I activated POSIX regular expressions and tried to replace

Code: Select all

 ^(.*)\n\1 
by

Code: Select all

 \1 
Unfortunately, I received the error "Invalid regular expression". I tried a very similar expression in a different Editor and it worked. What am I doing wrong?

Thank your very much in advance,

Adrian

Posted: Tue Jul 03, 2012 5:07 pm
by ak47wong
TextPad's regular expression engine doesn't allow the use of backreferences in the search string. Your options are:
  1. Use the other text editor you tried.
  2. Try this tool.
  3. Use the Sort function in TextPad (Tools > Sort) and select Delete duplicate lines, provided sorting the file at the same time is acceptable.

Posted: Tue Jul 03, 2012 9:34 pm
by ben_josephs
In fact, TextPad does allow back-references in a search string; it just doesn't allow them to refer back over a newline.

For example,
\<([^ ]+) \1\>
matches repeated words within a line.

Posted: Wed Jul 04, 2012 4:09 pm
by Adrian
Thank you very much for the fast response. Now that I know why my expression failed I was able to solve the problem with a three step approach:

1. Replacing all newlines by a unique string.

Code: Select all

Search:  \n
Replace: XXX
2. Replacing duplicate "lines"

Code: Select all

Search:  XXX(.*)XXX\1
Replace: XXX\1
3. Replacing the unique strings by newlines again.

Code: Select all

Search:  XXX
Replace: \n
It is a little bit ugly, but works for me and can maybe help someone else ;-)

Kind regards,

Adrian