Multiple-line xml tag search/replace in WildEdit?

General questions about using WildEdit

Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard

Post Reply
bottyang
Posts: 3
Joined: Mon Dec 18, 2006 10:40 pm

Multiple-line xml tag search/replace in WildEdit?

Post by bottyang »

Hi All,

I have 400+ XML files, and I need to discard those parts of these which do not contain an element embedded in another element that spans many lines.

More specifically, I need to keep <s> elements such as

<s>
<NOOJLU LEMMA="," CAT="WPUNCT">,</NOOJLU>
<NOOJLU LEMMA="mint" CAT="Con">mint</NOOJLU>
<NOOJLU LEMMA="téli" CAT="A" NOM>téli</NOOJLU>
<NOOJLU LEMMA="rokon" CAT="A" PSe3 NOM>rokona</NOOJLU>
<NOOJLU LEMMA="." CAT="SPUNCT">.</NOOJLU>
<IINF>
<NOOJLU LEMMA="iszik" CAT="V" INRt3>inniuk</NOOJLU>
</IINF>
</s>

because they contain the <IINF> element, but I need to discard all <s> elements that do not have the <IINF> element, for example:

<s>
<NOOJLU LEMMA="a" CAT="Det">A</NOOJLU>
<NOOJLU LEMMA="rendkívül" CAT="Adv">rendkívül</NOOJLU>
<NOOJLU LEMMA="könnyen" CAT="Adv">könnyen</NOOJLU>
<NOOJLU LEMMA="," CAT="WPUNCT">,</NOOJLU>
</s>

Is there a way to do this in WildEdit or I need to learn to write a script to do that? (I have searched the forum but haven't found the answer.)

Any help is appreciated,
Gergo
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

Note to new readers of this thread: This attempted solution doesn't work. See later posts below.
Find what: <s>(?!.*<IINF>.*).*?</s>
Replace with: [nothing]

[X] Regular expression
[X] Replacement format

Options
[ ] '.' does not match a newline character [i.e., not selected]
The subexpression .*? is non-greedy. It matches minimally; that is, it matches the shortest possible substring that allows the whole expression to match. (Whereas .* is greedy. It matches maximally; that is, it matches the longest possible substring.)

The subexpression (?!.*<IINF>.*) is a negative look-ahead assertion; it matches zero characters at a position in the text being searched only if .*<IINF>.* doesn't match at that position. In other words, it anchors the match to positions at which .*<IINF>.* doesn't match. That is, it matches only if what follows doesn't contain <IINF> .

So (?!.*<IINF>.*).*? matches (minimally) substrings that do not contain <IINF> .

BTW, your example is not valid XML. For example, the line
<NOOJLU LEMMA="rokon" CAT="A" PSe3 NOM>rokona</NOOJLU>
is illegal. Attributes have to have values. Even if the M in XML stands for Magyar. :-)
Last edited by ben_josephs on Thu Dec 21, 2006 9:22 am, edited 1 time in total.
bottyang
Posts: 3
Joined: Mon Dec 18, 2006 10:40 pm

Post by bottyang »

ben_josephs wrote:
Find what: <s>(?!.*<IINF>.*).*?</s>
Replace with: [nothing]

[X] Regular expression
[X] Replacement format

Options
[ ] '.' does not match a newline character [i.e., not selected]
Many thanks for the quick reply. Tried it, but it didn't seem to be working. This is the log I got:

=== BEGIN REPLACE COMMAND ===
{
Time: 2006-Dec-19 19:42:22
Search Pattern: <s>(?!.*<IINF>.*).*?</s>
Replacement Format:
Character Encoding: iso-8859-2
Root folder: J:\Disszertáció\Anyagok_és_módszerek\MNSz\Célkorpusz\Próba
File Filter: *.txt
Regular Expression: true
Replacement Format: true
Match Case: false
Match Words: false
Search Subfolders: false
}
J:/Disszertáció/Anyagok_és_módszerek/MNSz/Célkorpusz/Próba/próba1.txt: 0 replacements made
J:/Disszertáció/Anyagok_és_módszerek/MNSz/Célkorpusz/Próba/próba2.txt: 0 replacements made
Number of files searched: 2
Number of files modified: 0
Total changes made: 0
=== END REPLACE COMMAND ===

Any idea what went wrong?
BTW, your example is not valid XML. For example, the line
<NOOJLU LEMMA="rokon" CAT="A" PSe3 NOM>rokona</NOOJLU>
is illegal. Attributes have to have values. Even if the M in XML stands for Magyar. :-)
Yes, I know my example was not from a valid xml file. For some mysterious reason this quasi-XML is what the linguistic development tool I'm experimenting with requires as input format. (BTW it's NooJ (http://www.nooj4nlp.net) and free to use.)
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

It works here with the sample text you gave earlier.

Has your search pattern got a space at the end of it? It shouldn't have.

Did you deselect the option '.' does not match a newline character ?
bottyang
Posts: 3
Joined: Mon Dec 18, 2006 10:40 pm

Post by bottyang »

ben_josephs wrote:It works here with the sample text you gave earlier.

Has your search pattern got a space at the end of it? It shouldn't have.
There is no space at the end of my search pattern.
Did you deselect the option '.' does not match a newline character ?
Yes, I deselected this option. Still, it doesn't work. :? (Details in a private message.)
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

Oh dear, bottyang is right.

The trouble is that .*<IINF>.* matches whenever <IINF> occurs anywhere past the current position, even if there's a </s> before it. So (?!.*<IINF>.*) causes the match to fail at all postions before any <IINF> . So, for example, if there's an <IINF> element in the last <s> element, the whole expression always fails to match. I was using a very restrictive example when I tested my regex. Woops!

Here's another approach. This regex matches a <s> element that contains a sequence of items, where each item is either the beginning of a tag that isn't <IINF> or </IINF>, or is a single character that isn't < :
<s>(</?(?!IINF)|[^<])*?</s>\s*

Thanks to bottyang for getting me to think about this properly.

Later: WildEdit seems to be having problems with certain input text. If you try it in test mode on such problem text, the program crashes. If you run it on a file containing the problem text, it doesn't crash, but it leaves a temporary file containing part of the new version of the file; the original is left intact.
Post Reply