I have lots of text files I need to bash into shape, in the format -
"Aberdeen City",2810,1663,196,82,215,547,107
"Aberdeen City",2806,1663,196,81,215,546,105
"Aberdeen City",250,46,20,21,89,45,29
etc etc
I need to keep the first "Aberdeen City" but remove all other occurences, to end with something like -
"Aberdeen City",2810,1663,196,82,215,547,107,2806,1663,196,81,215,546,105,250,46,20,21,89,45,29
Any ideas?
Not sure if this is possible, but...
Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard
-
ben_josephs
- Posts: 2464
- Joined: Sun Mar 02, 2003 9:22 pm
Sorry, yes there are lines with other place names - these are files of data from the 2001 census in Scotland, so there are netries for all the Council Areas -
Aberdeen City, Angus, Argyll & Bute, Clackmannanshire etc.
This file is 4608 lines long, I have many similar files, each with a slightly different number of lines.
Aberdeen City, Angus, Argyll & Bute, Clackmannanshire etc.
This file is 4608 lines long, I have many similar files, each with a slightly different number of lines.
-
ben_josephs
- Posts: 2464
- Joined: Sun Mar 02, 2003 9:22 pm
Since your file is so small, the following should work. It will not work on big files.
First, use "Posix" regular expression syntax:
Then try the following three steps:
1. Make the entire text into a single line by replacing each newline with # (this is what won't work on a big file):
First, use "Posix" regular expression syntax:
Choose a character that does not occur in your document, say #.Configure | Preferences | Editor
[X] Use POSIX regular expression syntax
Then try the following three steps:
1. Make the entire text into a single line by replacing each newline with # (this is what won't work on a big file):
2. Remove the repeated place names:Find what: \n
Replace with: #
[X] Regular expression
Replace All
3. Change each # back to a newline:Find what: (^|#)([^,]+)(,[^#]+)#\2
Replace with: \1\2\3
[X] Regular expression
Replace All -- do this repeatedly until it beeps (twice, if all the records are similar to the ones in your sample)
Find what: #
Replace with: \n
[X] Regular expression
Replace All
Many thanks, that worked really well.
So, how did it work and why will it not work for large files and what is a large file in this case?
I have some similar files that I will need to work on later that are upwards of 420,000 lines long. However, they use fixed lenght zone codes rather than variable length place names.
Reagrds,
Davey
So, how did it work and why will it not work for large files and what is a large file in this case?
I have some similar files that I will need to work on later that are upwards of 420,000 lines long. However, they use fixed lenght zone codes rather than variable length place names.
Reagrds,
Davey
-
ben_josephs
- Posts: 2464
- Joined: Sun Mar 02, 2003 9:22 pm
(^|#)([^,]+)(,[^#]+)#\2 matches
where ^|# matches:
and [^,]+ matches:
and [^#]+ matches:
The problem with this solution with big files is that, although TextPad's line length limit is high (I don't know what it is), the program gets very slow when handling very long lines. Try it.
You might be able to avoid this problem by applying the solution to sections of your file, one at a time. Don't forget to save after each successful replacement.
If the lines didn't have to be joined this problem wouldn't arise. But they have to be joined because of a deficiency of TextPad's regex recogniser: its handling of newlines is very weak. In particular, it doesn't allow a back-reference to refer back over a newline.
The solution would be simpler using a tool with a more powerful regex recogniser, such as Helios's own WildEdit (http://www.textpad.com/products/wildedit/). Or you could do it with a script (but not in TextPad, as it doesn't support scripts).
TextPad's help has some notes on regular expressions, but they are rather brief. Look under
Reference Information | Regular Expressions,
Reference Information | Replacement Expressions and
How to... | Find and Replace Text | Use Regular Expressions.
There are many regular expression tutorials on the web, and you will find recommendations for some of them if you search this forum.
A standard reference for regular expressions is
Friedl, Jeffrey E F
Mastering Regular Expressions, 3rd ed
O'Reilly, 2006
ISBN: 0-596-52812-4
http://regex.info/
But be aware that the regular expression recogniser used by TextPad is very weak compared with modern tools. So you may get frustrated if you discover a handy trick that works elsewhere but doesn't work in TextPad.
Code: Select all
( start of captured text number 1
^|# either the beginning of a line or a hash (see below)
) end of captured text number 1
( start of captured text number 2
[^,]+ any non-empty string within a line not containing a comma (see below)
) end of captured text number 2
( start of captured text number 3
, a comma
[^#]+ any non-empty string within a line not containing a hash (see below)
) end of captured text number 3
# a hash
\2 captured text number 1
Code: Select all
^ the beginning of a line
| or
# a hash
Code: Select all
[^,] any character except newline or comma
+ ... any non-zero number of times
Code: Select all
[^#] any character except newline or comma
+ ... any (non-zero) number of times
You might be able to avoid this problem by applying the solution to sections of your file, one at a time. Don't forget to save after each successful replacement.
If the lines didn't have to be joined this problem wouldn't arise. But they have to be joined because of a deficiency of TextPad's regex recogniser: its handling of newlines is very weak. In particular, it doesn't allow a back-reference to refer back over a newline.
The solution would be simpler using a tool with a more powerful regex recogniser, such as Helios's own WildEdit (http://www.textpad.com/products/wildedit/). Or you could do it with a script (but not in TextPad, as it doesn't support scripts).
TextPad's help has some notes on regular expressions, but they are rather brief. Look under
Reference Information | Regular Expressions,
Reference Information | Replacement Expressions and
How to... | Find and Replace Text | Use Regular Expressions.
There are many regular expression tutorials on the web, and you will find recommendations for some of them if you search this forum.
A standard reference for regular expressions is
Friedl, Jeffrey E F
Mastering Regular Expressions, 3rd ed
O'Reilly, 2006
ISBN: 0-596-52812-4
http://regex.info/
But be aware that the regular expression recogniser used by TextPad is very weak compared with modern tools. So you may get frustrated if you discover a handy trick that works elsewhere but doesn't work in TextPad.