Page 1 of 1

Remove numbers, letters and commas left of a xml tag

Posted: Thu Jun 28, 2012 10:51 am
by djp6
Hi

I'm new to reg exp and I searched for hours but am more confused now than before!
I have an xml file with thousands of lines, many containing examples like:-

Code: Select all

<b203>e-Study Guide for: Western Humanities, Complete by Roy Matthews, ISBN 9780073376622</b203>
I need to remove the unique ISBN numbers plus the word ISBN and space and comma from all of the title tags

Code: Select all

<b203></b203>
, this is what I need to remove:-

, ISBN 978??????????

leaving only, in this example:-

Code: Select all

<b203>e-Study Guide for: Western Humanities, Complete by Roy Matthews</b203>
Each ISBN is different (as is title and author) but I thought some reg ex such as:-
remove 20 char (numbers, letters, commas and whitespace) to the left of </b203>
would be the easiest way but I can’t find the reg exp for this.
Can anyone help please?

Thanks in advance
Dave

Posted: Thu Jun 28, 2012 11:18 am
by ak47wong
First, enable POSIX regular expression syntax in Configure > Preferences > Editor.

This will delete all the ISBN numbers in the document regardless of what tag they're in:

Find what: ,_ISBN_[0-9]{13} (replace the underscores with spaces)
Replace with: [nothing]

Select Regular expression and click Replace All.

If you need to restrict the deletion to <b203> tags, do this:

Find what: ,_ISBN_[0-9]{13}(</b203>) (replace the underscores with spaces)
Replace with: \1

Or, you can do it the way you described and delete the 20 characters before the end tag:

Find what: .{20}(</b203>)
Replace with: \1

Posted: Thu Jun 28, 2012 3:27 pm
by djp6
Many thanks ak47wong, worked perfectly, I used the second option as seemed safer and was interested in the \1 replace.
Great, thanks again.