Hi All,
I have 400+ XML files, and I need to discard those parts of these which do not contain an element embedded in another element that spans many lines.
More specifically, I need to keep <s> elements such as
<s>
<NOOJLU LEMMA="," CAT="WPUNCT">,</NOOJLU>
<NOOJLU LEMMA="mint" CAT="Con">mint</NOOJLU>
<NOOJLU LEMMA="téli" CAT="A" NOM>téli</NOOJLU>
<NOOJLU LEMMA="rokon" CAT="A" PSe3 NOM>rokona</NOOJLU>
<NOOJLU LEMMA="." CAT="SPUNCT">.</NOOJLU>
<IINF>
<NOOJLU LEMMA="iszik" CAT="V" INRt3>inniuk</NOOJLU>
</IINF>
</s>
because they contain the <IINF> element, but I need to discard all <s> elements that do not have the <IINF> element, for example:
<s>
<NOOJLU LEMMA="a" CAT="Det">A</NOOJLU>
<NOOJLU LEMMA="rendkÃvül" CAT="Adv">rendkÃvül</NOOJLU>
<NOOJLU LEMMA="könnyen" CAT="Adv">könnyen</NOOJLU>
<NOOJLU LEMMA="," CAT="WPUNCT">,</NOOJLU>
</s>
Is there a way to do this in WildEdit or I need to learn to write a script to do that? (I have searched the forum but haven't found the answer.)
Any help is appreciated,
Gergo
Multiple-line xml tag search/replace in WildEdit?
Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard
-
- Posts: 2461
- Joined: Sun Mar 02, 2003 9:22 pm
Note to new readers of this thread: This attempted solution doesn't work. See later posts below.
The subexpression (?!.*<IINF>.*) is a negative look-ahead assertion; it matches zero characters at a position in the text being searched only if .*<IINF>.* doesn't match at that position. In other words, it anchors the match to positions at which .*<IINF>.* doesn't match. That is, it matches only if what follows doesn't contain <IINF> .
So (?!.*<IINF>.*).*? matches (minimally) substrings that do not contain <IINF> .
BTW, your example is not valid XML. For example, the line
<NOOJLU LEMMA="rokon" CAT="A" PSe3 NOM>rokona</NOOJLU>
is illegal. Attributes have to have values. Even if the M in XML stands for Magyar.
The subexpression .*? is non-greedy. It matches minimally; that is, it matches the shortest possible substring that allows the whole expression to match. (Whereas .* is greedy. It matches maximally; that is, it matches the longest possible substring.)Find what: <s>(?!.*<IINF>.*).*?</s>
Replace with: [nothing]
[X] Regular expression
[X] Replacement format
Options
[ ] '.' does not match a newline character [i.e., not selected]
The subexpression (?!.*<IINF>.*) is a negative look-ahead assertion; it matches zero characters at a position in the text being searched only if .*<IINF>.* doesn't match at that position. In other words, it anchors the match to positions at which .*<IINF>.* doesn't match. That is, it matches only if what follows doesn't contain <IINF> .
So (?!.*<IINF>.*).*? matches (minimally) substrings that do not contain <IINF> .
BTW, your example is not valid XML. For example, the line
<NOOJLU LEMMA="rokon" CAT="A" PSe3 NOM>rokona</NOOJLU>
is illegal. Attributes have to have values. Even if the M in XML stands for Magyar.
Last edited by ben_josephs on Thu Dec 21, 2006 9:22 am, edited 1 time in total.
ben_josephs wrote:Many thanks for the quick reply. Tried it, but it didn't seem to be working. This is the log I got:Find what: <s>(?!.*<IINF>.*).*?</s>
Replace with: [nothing]
[X] Regular expression
[X] Replacement format
Options
[ ] '.' does not match a newline character [i.e., not selected]
=== BEGIN REPLACE COMMAND ===
{
Time: 2006-Dec-19 19:42:22
Search Pattern: <s>(?!.*<IINF>.*).*?</s>
Replacement Format:
Character Encoding: iso-8859-2
Root folder: J:\Disszertáció\Anyagok_és_módszerek\MNSz\Célkorpusz\Próba
File Filter: *.txt
Regular Expression: true
Replacement Format: true
Match Case: false
Match Words: false
Search Subfolders: false
}
J:/Disszertáció/Anyagok_és_módszerek/MNSz/Célkorpusz/Próba/próba1.txt: 0 replacements made
J:/Disszertáció/Anyagok_és_módszerek/MNSz/Célkorpusz/Próba/próba2.txt: 0 replacements made
Number of files searched: 2
Number of files modified: 0
Total changes made: 0
=== END REPLACE COMMAND ===
Any idea what went wrong?Yes, I know my example was not from a valid xml file. For some mysterious reason this quasi-XML is what the linguistic development tool I'm experimenting with requires as input format. (BTW it's NooJ (http://www.nooj4nlp.net) and free to use.)BTW, your example is not valid XML. For example, the line
<NOOJLU LEMMA="rokon" CAT="A" PSe3 NOM>rokona</NOOJLU>
is illegal. Attributes have to have values. Even if the M in XML stands for Magyar.
-
- Posts: 2461
- Joined: Sun Mar 02, 2003 9:22 pm
There is no space at the end of my search pattern.ben_josephs wrote:It works here with the sample text you gave earlier.
Has your search pattern got a space at the end of it? It shouldn't have.
Yes, I deselected this option. Still, it doesn't work. (Details in a private message.)Did you deselect the option '.' does not match a newline character ?
-
- Posts: 2461
- Joined: Sun Mar 02, 2003 9:22 pm
Oh dear, bottyang is right.
The trouble is that .*<IINF>.* matches whenever <IINF> occurs anywhere past the current position, even if there's a </s> before it. So (?!.*<IINF>.*) causes the match to fail at all postions before any <IINF> . So, for example, if there's an <IINF> element in the last <s> element, the whole expression always fails to match. I was using a very restrictive example when I tested my regex. Woops!
Here's another approach. This regex matches a <s> element that contains a sequence of items, where each item is either the beginning of a tag that isn't <IINF> or </IINF>, or is a single character that isn't < :
<s>(</?(?!IINF)|[^<])*?</s>\s*
Thanks to bottyang for getting me to think about this properly.
Later: WildEdit seems to be having problems with certain input text. If you try it in test mode on such problem text, the program crashes. If you run it on a file containing the problem text, it doesn't crash, but it leaves a temporary file containing part of the new version of the file; the original is left intact.
The trouble is that .*<IINF>.* matches whenever <IINF> occurs anywhere past the current position, even if there's a </s> before it. So (?!.*<IINF>.*) causes the match to fail at all postions before any <IINF> . So, for example, if there's an <IINF> element in the last <s> element, the whole expression always fails to match. I was using a very restrictive example when I tested my regex. Woops!
Here's another approach. This regex matches a <s> element that contains a sequence of items, where each item is either the beginning of a tag that isn't <IINF> or </IINF>, or is a single character that isn't < :
<s>(</?(?!IINF)|[^<])*?</s>\s*
Thanks to bottyang for getting me to think about this properly.
Later: WildEdit seems to be having problems with certain input text. If you try it in test mode on such problem text, the program crashes. If you run it on a file containing the problem text, it doesn't crash, but it leaves a temporary file containing part of the new version of the file; the original is left intact.