Please help me with this specific regex

General questions about using TextPad

Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard

Post Reply
Rosseiro
Posts: 14
Joined: Tue Mar 22, 2011 1:33 am

Please help me with this specific regex

Post by Rosseiro »

Hi!

I have the following issue:

Code: Select all

o princípio da anterioridade nonagesimal. Lembrando que a letra c foi objeto da Emenda Constitucional nº 42 de 2003. 
<table><tr><td>
<p>Art. 150. Sem prejuízo de outras garantias asseguradas ao contribuinte, é vedado à União, aos Estados, ao Distrito Federal e aos Municípios:
<p>[...]
<p>III - cobrar tributos:
<p>[...]
<p>b) no mesmo exercício financeiro em que haja sido publicada a lei que os instituiu ou aumentou;
<p>c) antes de decorridos noventa dias da data em que haja sido publicada a lei que os instituiu ou aumentou, observado o disposto na alínea b; (Incluído pela Emenda Constitucional nº 42, de 19.12.2003)
</td></tr></table>
<p>Aprendemos que existem exceções. Imposto sobre a Importação, Imposto sobre a Exportação, e algumas contribuições que valem desde já. <span style="color: red;">Na prova, teremos que nominar!</span> O professor fatalmente irá pedir na prova. Qualquer prova que contenha Direito Tributário irá despencar isso.
<p>


Above, you can see a piece of HTML code. Regularly, you'll see the paragraphs beginning with the <p> tag, no matter how small they are.

But, as you can see too, there is a single-cell table above, enclosed by the "<td><tr><table>" and the "</td></tr></table>"

Inside that table block, I'd like to remove all the <p> tags automatically.
This can't be done with a macro in Textpad because we never know the lenght of the table. And there may be several in a single document. I want the <p> tags to stay out the tables, but not inside.
Any help? :)
PeteTheBloke
Posts: 39
Joined: Fri Apr 22, 2005 8:15 am
Location: N. Ireland
Contact:

Post by PeteTheBloke »

I think you might have to use a parser. I don't reckon regex is the tool for the job here. If you don't have closing </p> tags your HTML is not well-formed and a parser may not work correctly either (it won't be able to identify the nodes correctly).

Sorry about that. I hope I'm wrong.
ben_josephs
Posts: 2459
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

This is a job for a script. Unfortunately, TextPad doesn't support scripts.

However, you can do it in TextPad, although it's very tedious.

First, use "Posix" regular expression syntax:
Configure | Preferences | Editor

[X] Use POSIX regular expression syntax
Choose two characters that do not occur in your document, say † and ‡.

Then try the following five steps:

1. Replace each occurrence of </table> with a ‡ (so you can search for its absence):
Find what: </table>
Replace with: ‡

[X] Regular expression

Replace All
2. Replace each newline within a <table> element with a † (so that each <table> element is wholly within one line):
Find what: (<table>[^‡]*)\n
Replace with: \1†

[X] Regular expression

Replace All -- do this repeatedly until it beeps
3. Remove each occurrence of <p> within a <table> element:
Find what: (<table>[^‡]*)<p>
Replace with: \1

[X] Regular expression

Replace All -- do this repeatedly until it beeps
4. Change each † back to a newline:
Find what: †
Replace with: \n

[X] Regular expression

Replace All
5. Change each ‡ back to </table>:
Find what: ‡
Replace with: </table>

[X] Regular expression

Replace All
PeteTheBloke
Posts: 39
Joined: Fri Apr 22, 2005 8:15 am
Location: N. Ireland
Contact:

Post by PeteTheBloke »

Good effort Ben! I think I'd have given up before getting to that solution.
Rosseiro
Posts: 14
Joined: Tue Mar 22, 2011 1:33 am

Post by Rosseiro »

Working to test! Indeed a great effort! I'll come back later to say what's happened. Sorry for the long time took.
Rosseiro
Posts: 14
Joined: Tue Mar 22, 2011 1:33 am

Post by Rosseiro »

Working to test! Indeed a great effort! I'll come back later to say what's happened. Sorry for the long time took.

Edit: So, I tried. I couldn't help but I think there is a typo in step 3 above. Textpad won't find that particular regex. I'm sure I marked the POSIX operators option and I did not type the regex on my own, but copied and pasted yours. What's wrong?

Thanks in advance Ben (and Pete too!)
ben_josephs
Posts: 2459
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

Is there a spurious space at the end of your regex?
Rosseiro
Posts: 14
Joined: Tue Mar 22, 2011 1:33 am

Post by Rosseiro »

Oh My! That was it! Now it worked!
What a genious solution! Now let me put it inside a macro... I'll come back here if I have problems.

Just curious: you said this is a job for a script. What kind of scripts are you talking about?

For all, Thanks ben!! :D
ben_josephs
Posts: 2459
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

I mean scripts in your favourite scripting language, be it Perl, Python, Ruby, Tcl, ECMAscript ("JavaScript"), or what you will.
Post Reply