Page 1 of 1

Strip HTML tag

Posted: Tue Apr 04, 2006 12:37 am
by insolitude
Anyone have any leads on creating a macro to strip specific HTML tags? I've got a library of HTML code, and I need to strip out all SPAN and DIV tags. Most are formatted like this:

<span class=body>text is here</span>

<div class="body">text is here</div>

Any help would be much appreciated.

Posted: Tue Apr 04, 2006 6:57 am
by ben_josephs
Use Search | Replace...

Just the tags?
Find what: </?(span|div)[^>]*>
Replace with: [Nothing]

[X] Regular expression
Or whole elements?
Find what: <span[^>]*>[^<]*</span>|<div[^>]*>[^<]*</div>
Replace with: [Nothing]

[X] Regular expression
These will only work if the entire tags and elements are on one line.

This assumes you are using Posix regular expression syntax:
Configuration | Preferences | Editor

[X] Use POSIX regular expression syntax

Posted: Fri Apr 07, 2006 10:17 am
by daveokeeffe
ben josephs regularly expresses the sexiness of regular expressions. I think everything I know about TP's RE capabilities, I've learnt from your many RE posts. Thanks Ben.

Posted: Fri Apr 07, 2006 11:28 am
by ben_josephs
It's kind of you to say so! I'm glad my posts are useful.

The bible for these things is

Friedl, Jeffrey E F
Mastering Regular Expressions, 2nd ed
O'Reilly, 2002
ISBN: 0596002890
http://regex.info/

In this book you will find that there is much that can be done with modern extended regular expression recognisers that can't be done with the rather weak recogniser that TextPad uses. WildEdit (http://www.textpad.com/products/wildedit/) uses a far mor powerful one (Boost).

Posted: Fri Apr 07, 2006 11:36 am
by ben_josephs
ben_josephs wrote:
Find what: <span[^>]*>[^<]*</span>|<div[^>]*>[^<]*</div>
Replace with: [Nothing]

[X] Regular expression
Here's a neater way to do that:
Find what: <(span|div)[^>]*>[^<]*</\1>
Replace with: [Nothing]

[X] Regular expression