Page 1 of 1

Not sure if this is possible, but...

Posted: Mon Apr 23, 2012 1:41 pm
by daveytyke
I have lots of text files I need to bash into shape, in the format -

"Aberdeen City",2810,1663,196,82,215,547,107
"Aberdeen City",2806,1663,196,81,215,546,105
"Aberdeen City",250,46,20,21,89,45,29

etc etc

I need to keep the first "Aberdeen City" but remove all other occurences, to end with something like -


"Aberdeen City",2810,1663,196,82,215,547,107,2806,1663,196,81,215,546,105,250,46,20,21,89,45,29

Any ideas?

Posted: Mon Apr 23, 2012 9:43 pm
by ben_josephs
You haven't provided much information.

Are there lines with the names of places other than Aberdeen?

How big is the file?

Posted: Tue Apr 24, 2012 7:37 am
by daveytyke
Sorry, yes there are lines with other place names - these are files of data from the 2001 census in Scotland, so there are netries for all the Council Areas -

Aberdeen City, Angus, Argyll & Bute, Clackmannanshire etc.

This file is 4608 lines long, I have many similar files, each with a slightly different number of lines.

Posted: Tue Apr 24, 2012 8:56 am
by ben_josephs
Since your file is so small, the following should work. It will not work on big files.

First, use "Posix" regular expression syntax:
Configure | Preferences | Editor

[X] Use POSIX regular expression syntax
Choose a character that does not occur in your document, say #.

Then try the following three steps:

1. Make the entire text into a single line by replacing each newline with # (this is what won't work on a big file):
Find what: \n
Replace with: #

[X] Regular expression

Replace All
2. Remove the repeated place names:
Find what: (^|#)([^,]+)(,[^#]+)#\2
Replace with: \1\2\3

[X] Regular expression

Replace All -- do this repeatedly until it beeps (twice, if all the records are similar to the ones in your sample)
3. Change each # back to a newline:
Find what: #
Replace with: \n

[X] Regular expression

Replace All

Posted: Tue Apr 24, 2012 9:04 am
by daveytyke
Many thanks, that worked really well.

So, how did it work and why will it not work for large files and what is a large file in this case?

I have some similar files that I will need to work on later that are upwards of 420,000 lines long. However, they use fixed lenght zone codes rather than variable length place names.

Reagrds,

Davey

Posted: Tue Apr 24, 2012 10:28 am
by ben_josephs
(^|#)([^,]+)(,[^#]+)#\2 matches

Code: Select all

(           start of captured text number 1
  ^|#       either the beginning of a line or a hash (see below)
)           end of captured text number 1
(           start of captured text number 2
  [^,]+     any non-empty string within a line not containing a comma (see below)
)           end of captured text number 2
(           start of captured text number 3
  ,         a comma
  [^#]+     any non-empty string within a line not containing a hash (see below)
)           end of captured text number 3
#           a hash
\2          captured text number 1
where ^|# matches:

Code: Select all

^           the beginning of a line
|           or
#           a hash
and [^,]+ matches:

Code: Select all

[^,]        any character except newline or comma
+           ... any non-zero number of times
and [^#]+ matches:

Code: Select all

[^#]        any character except newline or comma
+           ... any (non-zero) number of times
The problem with this solution with big files is that, although TextPad's line length limit is high (I don't know what it is), the program gets very slow when handling very long lines. Try it.

You might be able to avoid this problem by applying the solution to sections of your file, one at a time. Don't forget to save after each successful replacement.

If the lines didn't have to be joined this problem wouldn't arise. But they have to be joined because of a deficiency of TextPad's regex recogniser: its handling of newlines is very weak. In particular, it doesn't allow a back-reference to refer back over a newline.

The solution would be simpler using a tool with a more powerful regex recogniser, such as Helios's own WildEdit (http://www.textpad.com/products/wildedit/). Or you could do it with a script (but not in TextPad, as it doesn't support scripts).

TextPad's help has some notes on regular expressions, but they are rather brief. Look under
Reference Information | Regular Expressions,
Reference Information | Replacement Expressions and
How to... | Find and Replace Text | Use Regular Expressions.

There are many regular expression tutorials on the web, and you will find recommendations for some of them if you search this forum.

A standard reference for regular expressions is

Friedl, Jeffrey E F
Mastering Regular Expressions, 3rd ed
O'Reilly, 2006
ISBN: 0-596-52812-4
http://regex.info/

But be aware that the regular expression recogniser used by TextPad is very weak compared with modern tools. So you may get frustrated if you discover a handy trick that works elsewhere but doesn't work in TextPad.

Posted: Tue Apr 24, 2012 10:52 am
by daveytyke
Thanks, I shall give it a go.