Extracting data, including some bracketed

terrypin · Post by **terrypin** » Thu Oct 24, 2013 3:49 pm

My source looks like this (a list of long distance UK walks):

1066 Country Walk, East Sussex - 50 km (31 miles) Pevensey Castle to Rye
Abbeys Amble, North Yorkshire, 167 km (104 miles)
Abbott's Hike, 172 km (107 miles) Cumbria challenging moorland walking
Ainsty Bounds Walk, North Yorkshire, circular from Tadcaster, 71 km (44 miles)
Angles Way, 123 km (76 miles) from Great Yarmouth to Knettishall Heath, with much of the path following the Norfolk/Suffolk border. Additionally there is a link path from Knettishall Heath to Thetford
Avon Valley Path, 54 km (34 miles) Christchurch to Salisbury (Hampshire and Wiltshire)
Basingstoke Canal, 53 km (33 miles)
Bishop Bennet Way, 55 km (34 miles) Beeston to Wirswall (Cheshire, Staffordshire)
etc

I want a list like this:

1066 Country Walk, 31
Abbeys Amble, 104
Abbott's Hike, 107
Ainsty Bounds Walk, 44
etc

(So that I can import it into a spreadsheet and sort by distance.)

IOW I want the name before the first comma followed by a comma, a space and the mileage taken from inside the first pair of brackets.

I'm not clear why this doesn't work:

Find: (.*), (.*) \((.*) (.*)
Replace with: \1, \3

--
Terry, East Grinstead, UK

ben_josephs · Post by **ben_josephs** » Thu Oct 24, 2013 5:11 pm

If you're tempted to use .* check whether it's what you really mean.

Try

Find what: ^([^,]+),[^(]*\((\d+).*
Replace with: $1, $2

Edit: Changed \ to new-style $ in replacement expression.

terrypin · Post by **terrypin** » Thu Oct 24, 2013 7:56 pm

Thanks, but here that's telling me it cannot find the RE

^([^,]+),[^(]*\((\d+).*

I agree, I'm expecting too much from .* and will have to study the rules carefully again!

--
Terry, East Grinstead, UK

ben_josephs · Post by **ben_josephs** » Thu Oct 24, 2013 9:17 pm

Are you still using version 4.7.2? Try

Find what: ^([^,]+),[^(]*\(([0-9]+).*
Replace with: \1, \2

Version 4.7.2 is over 10 years old. It may be advantageous to you (and to those trying to help you) if you upgraded to veriosn 7. It has many improvements, including a much better regex engine.

terrypin · Post by **terrypin** » Fri Oct 25, 2013 7:45 am

Thanks. I have 4.7.3, indeed ancient, but I'm so comfortable with it that I can't work up the enthusiasm to change. But I do appreciate the handicap that gives me when seeking help here.

Your revised version worked just fine, thank you.

I follow the last part but could you briefly interpret that first bold section for me please

^([^,]+),[^(]*\(([0-9]+).*

I'm going to try adapting it to a slight change of requirement, namely to get a result which includes all the original text except the km data. Like this:

1066 Country Walk, East Sussex - Pevensey Castle to Rye, 31
Abbeys Amble, North Yorkshire, 104
Abbott's Hike, Cumbria challenging moorland walking, 107
Ainsty Bounds Walk, North Yorkshire, circular from Tadcaster, 44
etc

Do you think infrequent RE users like me would be better advised to tackle tasks like this in separate stages? In this case for example:
1. Delete the bracketed km
2. Get bracketed miles to the end
3. Remove unwanted left and right brackets and 'miles'

Much appreciate your help.

--
Terry, East Grinstead, UK

ben_josephs · Post by **ben_josephs** » Fri Oct 25, 2013 9:31 am

^([^,]+),[^(]* matches

Code: Select all

^           the beginning of a line
(           start of captured text number 1
  [^,]+       any non-empty string within a line not containing a comma (see below)
)           end of captured text number 1
,           a comma
[^(]*       any (possibly empty) string within a line not containing a left parenthesis (see below)

where [^,]+ matches:

Code: Select all

[^,]        any character except newline or comma
+           ... any non-zero number of times

and [^(]* matches:

Code: Select all

[^(]        any character except newline or left parenthesis
*           ... any (possibly zero) number of times

For your new requirement try

Find what: ^(.*[^,]),? [0-9]+ km $([0-9]+) miles$(.*)
Replace with: \1\3, \2

But this doesn't handle properly the arbitrary inclusion or otherwise of commas in the way you indicate.
This is easier with TextPad 7's regex engine.

Yes, you might well find it easier to tackle such tasks in stages. With TextPad's old regex engine you often have no choice.

terrypin · Post by **terrypin** » Fri Oct 25, 2013 9:39 am

Many thanks, very helpful!

--
Terry, East Grinstead, UK