Extracting data, including some bracketed

General questions about using TextPad

Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard

Post Reply
terrypin
Posts: 174
Joined: Wed Jul 11, 2007 7:50 am

Extracting data, including some bracketed

Post by terrypin »

My source looks like this (a list of long distance UK walks):

1066 Country Walk, East Sussex - 50 km (31 miles) Pevensey Castle to Rye
Abbeys Amble, North Yorkshire, 167 km (104 miles)
Abbott's Hike, 172 km (107 miles) Cumbria challenging moorland walking
Ainsty Bounds Walk, North Yorkshire, circular from Tadcaster, 71 km (44 miles)
Angles Way, 123 km (76 miles) from Great Yarmouth to Knettishall Heath, with much of the path following the Norfolk/Suffolk border. Additionally there is a link path from Knettishall Heath to Thetford
Avon Valley Path, 54 km (34 miles) Christchurch to Salisbury (Hampshire and Wiltshire)
Basingstoke Canal, 53 km (33 miles)
Bishop Bennet Way, 55 km (34 miles) Beeston to Wirswall (Cheshire, Staffordshire)
etc

I want a list like this:

1066 Country Walk, 31
Abbeys Amble, 104
Abbott's Hike, 107
Ainsty Bounds Walk, 44
etc

(So that I can import it into a spreadsheet and sort by distance.)

IOW I want the name before the first comma followed by a comma, a space and the mileage taken from inside the first pair of brackets.


I'm not clear why this doesn't work:

Find: (.*), (.*) \((.*) (.*)
Replace with: \1, \3


--
Terry, East Grinstead, UK
ben_josephs
Posts: 2464
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

If you're tempted to use .* check whether it's what you really mean.

Try
Find what: ^([^,]+),[^(]*\((\d+).*
Replace with: $1, $2
Edit: Changed \ to new-style $ in replacement expression.
Last edited by ben_josephs on Thu Oct 24, 2013 9:13 pm, edited 1 time in total.
terrypin
Posts: 174
Joined: Wed Jul 11, 2007 7:50 am

Post by terrypin »

Thanks, but here that's telling me it cannot find the RE

^([^,]+),[^(]*\((\d+).*

I agree, I'm expecting too much from .* and will have to study the rules carefully again!

--
Terry, East Grinstead, UK
ben_josephs
Posts: 2464
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

Are you still using version 4.7.2? Try
Find what: ^([^,]+),[^(]*\(([0-9]+).*
Replace with: \1, \2
Version 4.7.2 is over 10 years old. It may be advantageous to you (and to those trying to help you) if you upgraded to veriosn 7. It has many improvements, including a much better regex engine.
terrypin
Posts: 174
Joined: Wed Jul 11, 2007 7:50 am

Post by terrypin »

Thanks. I have 4.7.3, indeed ancient, but I'm so comfortable with it that I can't work up the enthusiasm to change. But I do appreciate the handicap that gives me when seeking help here.

Your revised version worked just fine, thank you.

I follow the last part but could you briefly interpret that first bold section for me please

^([^,]+),[^(]*\(([0-9]+).*

I'm going to try adapting it to a slight change of requirement, namely to get a result which includes all the original text except the km data. Like this:

1066 Country Walk, East Sussex - Pevensey Castle to Rye, 31
Abbeys Amble, North Yorkshire, 104
Abbott's Hike, Cumbria challenging moorland walking, 107
Ainsty Bounds Walk, North Yorkshire, circular from Tadcaster, 44
etc

Do you think infrequent RE users like me would be better advised to tackle tasks like this in separate stages? In this case for example:
1. Delete the bracketed km
2. Get bracketed miles to the end
3. Remove unwanted left and right brackets and 'miles'

Much appreciate your help.

--
Terry, East Grinstead, UK
ben_josephs
Posts: 2464
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

^([^,]+),[^(]* matches

Code: Select all

^           the beginning of a line
(           start of captured text number 1
  [^,]+       any non-empty string within a line not containing a comma (see below)
)           end of captured text number 1
,           a comma
[^(]*       any (possibly empty) string within a line not containing a left parenthesis (see below)
where [^,]+ matches:

Code: Select all

[^,]        any character except newline or comma
+           ... any non-zero number of times
and [^(]* matches:

Code: Select all

[^(]        any character except newline or left parenthesis
*           ... any (possibly zero) number of times
For your new requirement try
Find what: ^(.*[^,]),? [0-9]+ km \(([0-9]+) miles\)(.*)
Replace with: \1\3, \2
But this doesn't handle properly the arbitrary inclusion or otherwise of commas in the way you indicate.
This is easier with TextPad 7's regex engine.

Yes, you might well find it easier to tackle such tasks in stages. With TextPad's old regex engine you often have no choice.
terrypin
Posts: 174
Joined: Wed Jul 11, 2007 7:50 am

Post by terrypin »

Many thanks, very helpful!

--
Terry, East Grinstead, UK
Post Reply