Page 1 of 1

Sorting, erasing multiple lines

Posted: Mon Sep 03, 2007 9:40 am
by bluebang
Especially big file manipulation may require a sorting option
'delete all lines with double occurece of the same sorting argument'
as an extension of
'delte double lines'.

Example:

Code: Select all

Peter   England
Julia    France
Hans     Germany
Gerlinde Germany
Margit   Norway
Nils     Norway
Victor   Poland
Marek    Poland
John     USA
Task: From this tabel a list of 'touched' countries shall be derivated
Activating this option while sorting columns 9 ... will result in:

Code: Select all

Peter   England
Julia    France
Hans     Germany
Margit   Norway
Victor   Poland
John     USA
Searchis this forum I found some questions and suggestions concerning sorting end extraction that will become obsolet with this extension.

Posted: Mon Sep 03, 2007 12:26 pm
by MudGuard
But then the question arises: which of the lines with the doubled (tripled/quadrupled/...) sort keys is to be kept?


If you only want to find a list of (unique) country names:

select the column with the country names (block select), copy it into a new file, do a sort with deletion of duplicate lines.

Posted: Mon Sep 03, 2007 3:23 pm
by bluebang
The algorithm will keep the first entry maybe as a result of a previous sorting. This list is only thought to be an example. There are lots of more redundancies that might be valuable and should remain in the file.

Posted: Tue Sep 04, 2007 7:30 am
by nvj1662
I suspect all this can be achieved via regualr expression. I'm sure if you post your requirement in that forum, one of the regex gurus will give you the answer.

Posted: Tue Sep 04, 2007 9:14 am
by ben_josephs
TextPad's regular expression recogniser doesn't allow back-references (such as \1) that refer back over a newline. So, within a TextPad regular expression, you can't refer back to text on a previous line that matched a subexpression of the same regular expression.

But you can do it with WildEdit (http://www.textpad.com/products/wildedit/), which uses a far more powerful regex recogniser. Repeatedly run this replacement:
Find what: ([^ ]+ +)(.+)\r?\n[^ ]+ +\2
Replace with: $1$2

[X] Regular expression
[X] Replacement format

Options
[X] '.' does not match a newline character

Posted: Wed Sep 05, 2007 8:18 am
by ben_josephs
Or, in WildEdit, run this just once:
Find what: ([^ ]+ +)(.+)(\r?\n[^ ]+ +\2)+
Replace with: $1$2

[X] Regular expression
[X] Replacement format

Options
[X] '.' does not match a newline character