Page 1 of 1

Please help me with this specific regex

Posted: Tue Apr 17, 2012 2:47 am
by Rosseiro
Hi!

I have the following issue:

Code: Select all

o princípio da anterioridade nonagesimal. Lembrando que a letra c foi objeto da Emenda Constitucional nº 42 de 2003. 
<table><tr><td>
<p>Art. 150. Sem prejuízo de outras garantias asseguradas ao contribuinte, é vedado à União, aos Estados, ao Distrito Federal e aos Municípios:
<p>[...]
<p>III - cobrar tributos:
<p>[...]
<p>b) no mesmo exercício financeiro em que haja sido publicada a lei que os instituiu ou aumentou;
<p>c) antes de decorridos noventa dias da data em que haja sido publicada a lei que os instituiu ou aumentou, observado o disposto na alínea b; (Incluído pela Emenda Constitucional nº 42, de 19.12.2003)
</td></tr></table>
<p>Aprendemos que existem exceções. Imposto sobre a Importação, Imposto sobre a Exportação, e algumas contribuições que valem desde já. <span style="color: red;">Na prova, teremos que nominar!</span> O professor fatalmente irá pedir na prova. Qualquer prova que contenha Direito Tributário irá despencar isso.
<p>


Above, you can see a piece of HTML code. Regularly, you'll see the paragraphs beginning with the <p> tag, no matter how small they are.

But, as you can see too, there is a single-cell table above, enclosed by the "<td><tr><table>" and the "</td></tr></table>"

Inside that table block, I'd like to remove all the <p> tags automatically.
This can't be done with a macro in Textpad because we never know the lenght of the table. And there may be several in a single document. I want the <p> tags to stay out the tables, but not inside.
Any help? :)

Posted: Fri Apr 20, 2012 12:00 pm
by PeteTheBloke
I think you might have to use a parser. I don't reckon regex is the tool for the job here. If you don't have closing </p> tags your HTML is not well-formed and a parser may not work correctly either (it won't be able to identify the nodes correctly).

Sorry about that. I hope I'm wrong.

Posted: Fri Apr 20, 2012 2:01 pm
by ben_josephs
This is a job for a script. Unfortunately, TextPad doesn't support scripts.

However, you can do it in TextPad, although it's very tedious.

First, use "Posix" regular expression syntax:
Configure | Preferences | Editor

[X] Use POSIX regular expression syntax
Choose two characters that do not occur in your document, say † and ‡.

Then try the following five steps:

1. Replace each occurrence of </table> with a ‡ (so you can search for its absence):
Find what: </table>
Replace with: ‡

[X] Regular expression

Replace All
2. Replace each newline within a <table> element with a † (so that each <table> element is wholly within one line):
Find what: (<table>[^‡]*)\n
Replace with: \1†

[X] Regular expression

Replace All -- do this repeatedly until it beeps
3. Remove each occurrence of <p> within a <table> element:
Find what: (<table>[^‡]*)<p>
Replace with: \1

[X] Regular expression

Replace All -- do this repeatedly until it beeps
4. Change each † back to a newline:
Find what: †
Replace with: \n

[X] Regular expression

Replace All
5. Change each ‡ back to </table>:
Find what: ‡
Replace with: </table>

[X] Regular expression

Replace All

Posted: Fri Apr 20, 2012 7:46 pm
by PeteTheBloke
Good effort Ben! I think I'd have given up before getting to that solution.

Posted: Thu Apr 26, 2012 1:34 am
by Rosseiro
Working to test! Indeed a great effort! I'll come back later to say what's happened. Sorry for the long time took.

Posted: Sun May 06, 2012 7:49 am
by Rosseiro
Working to test! Indeed a great effort! I'll come back later to say what's happened. Sorry for the long time took.

Edit: So, I tried. I couldn't help but I think there is a typo in step 3 above. Textpad won't find that particular regex. I'm sure I marked the POSIX operators option and I did not type the regex on my own, but copied and pasted yours. What's wrong?

Thanks in advance Ben (and Pete too!)

Posted: Sun May 06, 2012 8:13 am
by ben_josephs
Is there a spurious space at the end of your regex?

Posted: Sun May 06, 2012 9:06 am
by Rosseiro
Oh My! That was it! Now it worked!
What a genious solution! Now let me put it inside a macro... I'll come back here if I have problems.

Just curious: you said this is a job for a script. What kind of scripts are you talking about?

For all, Thanks ben!! :D

Posted: Sun May 06, 2012 10:52 am
by ben_josephs
I mean scripts in your favourite scripting language, be it Perl, Python, Ruby, Tcl, ECMAscript ("JavaScript"), or what you will.