Strip HTML tag

General questions about using TextPad

Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard

Post Reply
insolitude
Posts: 8
Joined: Tue Apr 04, 2006 12:33 am

Strip HTML tag

Post by insolitude »

Anyone have any leads on creating a macro to strip specific HTML tags? I've got a library of HTML code, and I need to strip out all SPAN and DIV tags. Most are formatted like this:

<span class=body>text is here</span>

<div class="body">text is here</div>

Any help would be much appreciated.
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

Use Search | Replace...

Just the tags?
Find what: </?(span|div)[^>]*>
Replace with: [Nothing]

[X] Regular expression
Or whole elements?
Find what: <span[^>]*>[^<]*</span>|<div[^>]*>[^<]*</div>
Replace with: [Nothing]

[X] Regular expression
These will only work if the entire tags and elements are on one line.

This assumes you are using Posix regular expression syntax:
Configuration | Preferences | Editor

[X] Use POSIX regular expression syntax
daveokeeffe
Posts: 10
Joined: Tue May 17, 2005 1:53 pm

Post by daveokeeffe »

ben josephs regularly expresses the sexiness of regular expressions. I think everything I know about TP's RE capabilities, I've learnt from your many RE posts. Thanks Ben.
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

It's kind of you to say so! I'm glad my posts are useful.

The bible for these things is

Friedl, Jeffrey E F
Mastering Regular Expressions, 2nd ed
O'Reilly, 2002
ISBN: 0596002890
http://regex.info/

In this book you will find that there is much that can be done with modern extended regular expression recognisers that can't be done with the rather weak recogniser that TextPad uses. WildEdit (http://www.textpad.com/products/wildedit/) uses a far mor powerful one (Boost).
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

ben_josephs wrote:
Find what: <span[^>]*>[^<]*</span>|<div[^>]*>[^<]*</div>
Replace with: [Nothing]

[X] Regular expression
Here's a neater way to do that:
Find what: <(span|div)[^>]*>[^<]*</\1>
Replace with: [Nothing]

[X] Regular expression
Post Reply