Stripping HTML / XML tags AND content

General questions about using TextPad

Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard

Post Reply
kessa
Posts: 13
Joined: Tue Mar 13, 2007 12:18 pm

Stripping HTML / XML tags AND content

Post by kessa »

Hi All,

I'm new to this forum, and so first of all a quick apology if this question has already been asked / answered in this forum (I had a quick search and couldn't find the same scenario so I'll assume it's safe to continue :D )

I've been provided with an XML file from a third party which is absolutely massive - over 80mb and well over 1,700,000 (yes, that's right - over 1.5 million lines!)

To help speed up any work which I need to do on the file I wanted to strip out any tags (and associated content) which I don't need.

I know how to do this via Dreamweaver, but due to the massive file size Dreamweaver struggles to render it (and then struggles even more when I try to run a find/replace to delete the content)

Textpad seems to be massively quicker and so I wanted to know how to do the same thing in Textpad (version 4.7)

Example:

Code: Select all

<tag1>some content</tag1>
<tag2>some more content</tag2>
<tag3>and some more....</tag3>
If I wanted to strip out <tag2> plus all of the content (so:"<tag2>some more content</tag2>" how would I go about doing this in Textpad?

Note:
I've seen in some of the previous posts that there seems to be an issue if content is on more than one line - as a result, I need something which will work regardless of the number of lines which the code runs over.

Thanks
Kessa
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

See http://www.textpad.info/forum/viewtopic.php?t=7231, towards the end of the thread.

You can't do this easily in TextPad, as there's no way in TextPad to match text containing an arbitrary number of newlines.

You might try WildEdit (http://www.textpad.com/products/wildedit/). Assuming that a <tag1> element doesn't contain any inner <tag1> elements, etc., this might do what you need:
Find what: <tag1>.*?</tag1>
Replace with: [nothing]

[X] Regular expression

Options
[ ] '.' does not match a newline character [i.e., not selected]
Or, more generally:
Find what: <(tag1|tag2|tag3)>.*?</\1>
Replace with: [nothing]
You'll have to buy a licence for WildEdit to use it for files of the size you've indicated. A script might be a better way to go.
kessa
Posts: 13
Joined: Tue Mar 13, 2007 12:18 pm

Post by kessa »

Hi ben_josephs,

Thanks for this, it's really helpful.

I'll check out wildedit and then give this a shot.

Cheers :D
Kessa
User avatar
MudGuard
Posts: 1295
Joined: Sun Mar 02, 2003 10:15 pm
Location: Munich, Germany
Contact:

Post by MudGuard »

Just a warning:
<(tag1|tag2|tag3)>.*?</\1>
will fail if these elements are contained in themselves, like

<tag1>111<tag1>222</tag1>333</tag1>

Then only <tag1>111<tag1>222</tag1> will be matched, but not 333</tag1>

(Regular expressions are no good for nested stuff as nesting levels can't be counted)
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

My suggestion contains that warning.
kessa
Posts: 13
Joined: Tue Mar 13, 2007 12:18 pm

Post by kessa »

Hi ben_josephs,

Just wanted to say that I tried your suggestion of:
Find what: <(tag1|tag2|tag3)>.*?</\1>
Replace with: [nothing]
... worked a treat - in fact, it worked even better than I had hoped (I expected I would have to declare each tag individually - I didn't realise that I could list them all and just run the job once - so a massive thank you for saving me hours of hard labour!!! :D )

I've now got another question :wink: which I hope you can help me with?

The XML feed I am working with comes from a company in Europe and as a result, there are quite a few spelling mistakes / errors in the English translation.

Ideally I'd like to avoid having to manually do a spelling check each time I process the feed as it contains well over 1.7 million lines!

As a result, I wondered if it was possible for me to specify a list of words to look out for, and what these should be replaced with?

- I can then run this using wildedit and hopefully knock a lot of the errors on the head really quickly!

So for example, how would I go about performing the following replacement:

fishng = fishing
hellllooo = hello
gooooogle = google

(The above should NOT be case sensitive)

Also, I might need to do a case sensitive find replace such as:

dvd = DVD
england = England

How would I go about doing this? (I'm happy to run that as a seperate find / replace job)

Also, will I need to specify a space before / after each of the above? (I don't know / don't want wildedit to use stem searching)

Thanks
Kessa
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

In WildEdit:
Find what: \<(?:(fishng)|(hellllooo)|(gooooogle))\>
Replace with: ?1(fishing):?2(hello):?3(google)

[X] Regular expression
[X] Replacement format
[ ] Match case
In WildEdit's help, search for the phrases
start word for the regular expressions \< and \>,
non-marking parentheses for the regular expression construct (?:...), and
conditional expression for the replacement expression construct ?n...:...

Surely the searches for dvd, england, etc., are case insensitive, too.

If you have a lot of these misspellings, you may want to consider doing this with a script instead.
kessa
Posts: 13
Joined: Tue Mar 13, 2007 12:18 pm

Post by kessa »

fab - thanks :D
kessa
Posts: 13
Joined: Tue Mar 13, 2007 12:18 pm

Post by kessa »

Hi ben_josephs,

I wonder if you may be able to help me with something, just following on from the posts above?

I've tried using the code suggested, and for the most part it seems to work fine.

However, for a couple of the things I am trying to replace I am getting some weird results.

For example, if I do a search for "livingroom" using textpad, I get 3,709 results in my document.

However, if I do a find / replace in Wildedit, it seems to return 3,748 results? (an additional 39 results)

In Textpad, I'm just using a bulk standard find (i.e - not case sensitive or using regular expressions)

In Wildedit, I'm using the code above, and am trying to replace "livingroom" with "living room"

Any ideas what could be causing the inconsitencies?

Thanks
Kessa
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

Please provide examples of what is found in WildEdit but not in TextPad.
kessa
Posts: 13
Joined: Tue Mar 13, 2007 12:18 pm

Post by kessa »

Hi,

How do I find this out as when I run the find/replace it just tells me how many updates were made - it doesn't seem to show me where it made the changes?

Am I looking in the wrong place / do I need to do something differently?

Cheers
Kessa
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

There are many applications that will show you the differences between two files, including... TextPad. Ask it to show you the differences between what is produced by TextPad and by WildEdit.
Post Reply