Hi. I need to find and eliminate duplicate lines on a file with more than 5000 lines. I know this can be partly achieved with the "sort" + "eliminate duplicate lines" command. However what I need to do is slightly different.
Wherever there is a duplicate line I need to eliminate both the duplicate AND the original line. Since the file is very large and the strings quite long, it is a rather tedious process to do it by hand.
I've tried to understand the RE syntax for "find and replace" but could not come up with a solution. Can anybody help?
Henrique Serra
serra@cpd.ufmt.br
Finding repeated lines
Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard
-
Ed Orchard
Re: Finding repeated lines
Are there only ever pairs of identical lines? If so then there is a (clunky) solution.
If there could be more than 2 identical lines then it won't work.
1. Select a character that is unused in the file (¬ in this example)
2. Sort without deleting duplicates
3. Combine pairs of lines by:
Search for ^\(.*\)\n\(.*\)$
Replace all with \1¬\2
4. Mark repeating pairs by:
Search for ^\([^¬]*\)¬\1$
Mark All
5. Delete marked lines
Edit/Cut other/Bookmarked lines
6. Remove ¬
Search ¬
Replace all with \n
7. Add a dummy line at start of file
8. Repeat steps 3 to 6
9. Remove dummy line
voila
If there could be more than 2 identical lines then it won't work.
1. Select a character that is unused in the file (¬ in this example)
2. Sort without deleting duplicates
3. Combine pairs of lines by:
Search for ^\(.*\)\n\(.*\)$
Replace all with \1¬\2
4. Mark repeating pairs by:
Search for ^\([^¬]*\)¬\1$
Mark All
5. Delete marked lines
Edit/Cut other/Bookmarked lines
6. Remove ¬
Search ¬
Replace all with \n
7. Add a dummy line at start of file
8. Repeat steps 3 to 6
9. Remove dummy line
voila
-
Henrique Serra
Re: Finding repeated lines
Yes, Ed, your answer completely addressed the issue. Your clever solution works great. Thank you so very much for your help.
Henrique Serra
Henrique Serra