Strip HTML tag

insolitude · Post by **insolitude** » Tue Apr 04, 2006 12:37 am

Anyone have any leads on creating a macro to strip specific HTML tags? I've got a library of HTML code, and I need to strip out all SPAN and DIV tags. Most are formatted like this:

<span class=body>text is here</span>

<div class="body">text is here</div>

Any help would be much appreciated.

ben_josephs · Post by **ben_josephs** » Tue Apr 04, 2006 6:57 am

Use Search | Replace...

Just the tags?

Find what: </?(span|div)[^>]*>
Replace with: [Nothing]

[X] Regular expression

Or whole elements?

Find what: <span[^>]*>[^<]*</span>|<div[^>]*>[^<]*</div>
Replace with: [Nothing]

[X] Regular expression

These will only work if the entire tags and elements are on one line.

This assumes you are using Posix regular expression syntax:

Configuration | Preferences | Editor

[X] Use POSIX regular expression syntax

daveokeeffe · Post by **daveokeeffe** » Fri Apr 07, 2006 10:17 am

ben josephs regularly expresses the sexiness of regular expressions. I think everything I know about TP's RE capabilities, I've learnt from your many RE posts. Thanks Ben.

ben_josephs · Post by **ben_josephs** » Fri Apr 07, 2006 11:28 am

It's kind of you to say so! I'm glad my posts are useful.

The bible for these things is

Friedl, Jeffrey E F
Mastering Regular Expressions, 2nd ed
O'Reilly, 2002
ISBN: 0596002890
http://regex.info/

In this book you will find that there is much that can be done with modern extended regular expression recognisers that can't be done with the rather weak recogniser that TextPad uses. WildEdit (http://www.textpad.com/products/wildedit/) uses a far mor powerful one (Boost).

ben_josephs · Post by **ben_josephs** » Fri Apr 07, 2006 11:36 am

ben_josephs wrote:
Find what: <span[^>]*>[^<]*</span>|<div[^>]*>[^<]*</div>
Replace with: [Nothing]

[X] Regular expression

Here's a neater way to do that:

Find what: <(span|div)[^>]*>[^<]*</\1>
Replace with: [Nothing]

[X] Regular expression