Page 1 of 1

Stripping HTML / XML tags AND content

Posted: Wed Mar 14, 2007 3:15 pm
by kessa
Hi All,

I'm new to this forum, and so first of all a quick apology if this question has already been asked / answered in this forum (I had a quick search and couldn't find the same scenario so I'll assume it's safe to continue :D )

I've been provided with an XML file from a third party which is absolutely massive - over 80mb and well over 1,700,000 (yes, that's right - over 1.5 million lines!)

To help speed up any work which I need to do on the file I wanted to strip out any tags (and associated content) which I don't need.

I know how to do this via Dreamweaver, but due to the massive file size Dreamweaver struggles to render it (and then struggles even more when I try to run a find/replace to delete the content)

Textpad seems to be massively quicker and so I wanted to know how to do the same thing in Textpad (version 4.7)

Example:

Code: Select all

<tag1>some content</tag1>
<tag2>some more content</tag2>
<tag3>and some more....</tag3>
If I wanted to strip out <tag2> plus all of the content (so:"<tag2>some more content</tag2>" how would I go about doing this in Textpad?

Note:
I've seen in some of the previous posts that there seems to be an issue if content is on more than one line - as a result, I need something which will work regardless of the number of lines which the code runs over.

Thanks
Kessa

Posted: Wed Mar 14, 2007 4:47 pm
by ben_josephs
See http://www.textpad.info/forum/viewtopic.php?t=7231, towards the end of the thread.

You can't do this easily in TextPad, as there's no way in TextPad to match text containing an arbitrary number of newlines.

You might try WildEdit (http://www.textpad.com/products/wildedit/). Assuming that a <tag1> element doesn't contain any inner <tag1> elements, etc., this might do what you need:
Find what: <tag1>.*?</tag1>
Replace with: [nothing]

[X] Regular expression

Options
[ ] '.' does not match a newline character [i.e., not selected]
Or, more generally:
Find what: <(tag1|tag2|tag3)>.*?</\1>
Replace with: [nothing]
You'll have to buy a licence for WildEdit to use it for files of the size you've indicated. A script might be a better way to go.

Posted: Wed Mar 14, 2007 5:11 pm
by kessa
Hi ben_josephs,

Thanks for this, it's really helpful.

I'll check out wildedit and then give this a shot.

Cheers :D
Kessa

Posted: Wed Mar 14, 2007 7:59 pm
by MudGuard
Just a warning:
<(tag1|tag2|tag3)>.*?</\1>
will fail if these elements are contained in themselves, like

<tag1>111<tag1>222</tag1>333</tag1>

Then only <tag1>111<tag1>222</tag1> will be matched, but not 333</tag1>

(Regular expressions are no good for nested stuff as nesting levels can't be counted)

Posted: Wed Mar 14, 2007 8:39 pm
by ben_josephs
My suggestion contains that warning.

Posted: Tue Mar 20, 2007 11:25 pm
by kessa
Hi ben_josephs,

Just wanted to say that I tried your suggestion of:
Find what: <(tag1|tag2|tag3)>.*?</\1>
Replace with: [nothing]
... worked a treat - in fact, it worked even better than I had hoped (I expected I would have to declare each tag individually - I didn't realise that I could list them all and just run the job once - so a massive thank you for saving me hours of hard labour!!! :D )

I've now got another question :wink: which I hope you can help me with?

The XML feed I am working with comes from a company in Europe and as a result, there are quite a few spelling mistakes / errors in the English translation.

Ideally I'd like to avoid having to manually do a spelling check each time I process the feed as it contains well over 1.7 million lines!

As a result, I wondered if it was possible for me to specify a list of words to look out for, and what these should be replaced with?

- I can then run this using wildedit and hopefully knock a lot of the errors on the head really quickly!

So for example, how would I go about performing the following replacement:

fishng = fishing
hellllooo = hello
gooooogle = google

(The above should NOT be case sensitive)

Also, I might need to do a case sensitive find replace such as:

dvd = DVD
england = England

How would I go about doing this? (I'm happy to run that as a seperate find / replace job)

Also, will I need to specify a space before / after each of the above? (I don't know / don't want wildedit to use stem searching)

Thanks
Kessa

Posted: Wed Mar 21, 2007 9:26 am
by ben_josephs
In WildEdit:
Find what: \<(?:(fishng)|(hellllooo)|(gooooogle))\>
Replace with: ?1(fishing):?2(hello):?3(google)

[X] Regular expression
[X] Replacement format
[ ] Match case
In WildEdit's help, search for the phrases
start word for the regular expressions \< and \>,
non-marking parentheses for the regular expression construct (?:...), and
conditional expression for the replacement expression construct ?n...:...

Surely the searches for dvd, england, etc., are case insensitive, too.

If you have a lot of these misspellings, you may want to consider doing this with a script instead.

Posted: Wed Mar 21, 2007 4:46 pm
by kessa
fab - thanks :D

Posted: Tue May 22, 2007 11:14 pm
by kessa
Hi ben_josephs,

I wonder if you may be able to help me with something, just following on from the posts above?

I've tried using the code suggested, and for the most part it seems to work fine.

However, for a couple of the things I am trying to replace I am getting some weird results.

For example, if I do a search for "livingroom" using textpad, I get 3,709 results in my document.

However, if I do a find / replace in Wildedit, it seems to return 3,748 results? (an additional 39 results)

In Textpad, I'm just using a bulk standard find (i.e - not case sensitive or using regular expressions)

In Wildedit, I'm using the code above, and am trying to replace "livingroom" with "living room"

Any ideas what could be causing the inconsitencies?

Thanks
Kessa

Posted: Wed May 23, 2007 8:22 am
by ben_josephs
Please provide examples of what is found in WildEdit but not in TextPad.

Posted: Wed May 23, 2007 10:42 am
by kessa
Hi,

How do I find this out as when I run the find/replace it just tells me how many updates were made - it doesn't seem to show me where it made the changes?

Am I looking in the wrong place / do I need to do something differently?

Cheers
Kessa

Posted: Sun May 27, 2007 12:00 pm
by ben_josephs
There are many applications that will show you the differences between two files, including... TextPad. Ask it to show you the differences between what is produced by TextPad and by WildEdit.