How to remove extra spaces, gaps, line breaks, etc.

General questions about using TextPad

Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard

Post Reply
Ben

How to remove extra spaces, gaps, line breaks, etc.

Post by Ben »

Randall and Andreas' regexp's work beautifully on correctly developed HTML. Unfortunately they are stumbling on the poorly coded HTML that someone else developed in FrontPage. There are extra line breaks and spaces everywhere, prematurely reducing the number of matches found. I know how to remove every single line break and gap in a file, but then it becomes hard to read without a few logically placed line breaks and tab indents. I'm looking for a more accurate cure that limits the collateral damage I cause to the surrounding code. Can anyone help me with regexp's that get rid of the following:

1) Extra spaces that occur in the middle of tags or text strings, such as:

<a href="homepage.html">Contact List</a>
<a href="homepage.html">Contact List</a>


2) Extra line breaks that occur BETWEEN different tags (but I don't want to remove every line break in the page, and I don't want to disturb the existing tab structure if possible):

<a href="homepage.html">
<font face="arial">Contact List</font></a>


3) Extra line breaks that occur WITHIN tags:

<a
href="homepage.html">Contact List</a>


4) If different from above, I want to remove any extra line breaks that occur WITHIN text too:

<a href="homepage.html">Contact
List</a>

I have tested some regexp's of my own (using tagged expressions and combinations of [:space:], [:blank:], and \n), but they are too sensitive and change more code than I want. I realize some of these things may require multiple passes. Thanks in advance for any help.

Ben
Ben

Correction to example number 1

Post by Ben »

Substitute *space* for "_"

1) Extra spaces that occur in the middle of tags or text strings, such as:

<a_____________href="homepage.html">Contact List</a>
<a href="homepage.html">Contact_____________List</a>

Ben
Ben

Can someone enhance this?

Post by Ben »

Silly to reply to myself, but the following FIND expression seems to get rid of the extra spaces:

[ \t][ \t]+

However, it tends to remove any pseudo-spaces used for tab indents. Any other ways to accomplish this?

Any suggestions for the other items above?

Ben
Andreas

Re: Can someone enhance this?

Post by Andreas »

use
\([^ \t]\)[ \t][ \t]+
and replace by
\1
i.e \1 followed by a space.
This will find one non-space/tab followed by more than one space or tab
and replace it with the found character plus a space
Randall McDougall

Re: Can someone enhance this?

Post by Randall McDougall »

To remove hard breaks imbedded in tags (#3 above) use:

Regexp: \(<[^>]*\)\n
Replace: \1_

"_"=a space ... if the tag is broken across several lines (ie. one parameter to a line or some such) you'll have to repeat, so repeat until not found works best.

#2 & #4 are harder ... for two reasons: (a) you want to match only certain tags [doing <body> or would not be what you want, so there's the question of *which* -- so far you're primarily interested in anchor tags it seems] and (b) the matching criteria is for a negative condition which is not "intuitive" with regexps and usually needs some transformation/change/de-transform to work well ... I'll give it some thought and reply in a bit.
Ben

OK, you guys nailed 1 and 3 . . .

Post by Ben »

...and the regexps worked great so far. I see how 2 and 4 may be trickier. Let's see if I can help limit the constraints further for each.

2) Extra line breaks that occur BETWEEN different tags DIRECTLY FOLLOWING or WITHIN a set of anchor tags. Worst case scenario might be something like:

<a href="homepage.html">
<font face="arial">
<b>Contact List</b></font>
</a>

(where the occurrence of breaks is not necessarily consistent for a set of opening and closing tags).


4) Extra line breaks that occur WITHIN anchor tag link text ONLY:

<a href="homepage.html">Contact
List</a>

Ben
Ben

Just thought of one more roadblock

Post by Ben »

The regexp's you guys developed for cleaning up the extra tags in title attributes:

FIND: \(title="Link to [^<"]*\)<[^>]*>\([^"]*\)
REPLACE: \1\2

...won't find any matches that happen to include an non-breaking spaces ( ) in the FIND expression. This seems weird because the regexp that finds and generates the title attribute doesn't care about any non-breaking spaces, only the regexp I use afterward to clean it up.

Example:

<a href="homepage.html" title="  Contact List">  Contact List
</a>


Any idea as to what is tripping up Textpad in this instance? Thanks again for all the great help! I only know enough about regexp's to get myself in trouble, so I must humbly yield to the masters. :)

Ben
Ben

Correction...

Post by Ben »

Substitute "nbsp" for the "XXXX" characters in the example below:

<a href="homepage.html" title="&XXXX;&XXXX;Contact List">&XXXX;&XXXX;Contact List
</a>

Ben
Randall McDougall

Re: Correction...

Post by Randall McDougall »

Sorry it took so long to get back to you on this, life's been ... interesting.

For the last question first, and "nbsp" is not an actual space until interpreted by a browser for display, to a regexp it's just a string of characters... as part of your Title attribute they're irrelevant, so you should convert them to real spaces:

regexp: \(<a [^>]*title="[^"]*\)\(&_nbsp_;\)+
replace: \1_

where: &_nbsp_; has the "_" to keep the forum from actually converting to a space character, and "_" in the replace is a trailing space.

I think we can deal with 2 & 4 as being basically the same problem: remove all line breaks between any anchor [a] tags and corresponding end anchor [/a] tags ... to do that first you need to "simplify" the end-anchors temporarily:

regexp: </a>
replace \x09

now, repeat the following until it fails to match anything:

regexp: \(<a [^\x09]*\)\n\(.\)
replace: \1 \2

and you're almost done *unless* there are anchor tags imbedded in other anchor tags [a bad idea, but not impossible or without some uses] that contain line breaks after the inside tag closes -- *that* could be nearly impossible to fix automatically...

find: <a [^\x09]*<a

to see if you've got any of those (and fix them yourself -- there certainly shouldn't be many (if there are it's possible to convert them to a side-by-side layout, but the regexp for that is messier than it's worth if you don't have to ...

now you'll have to reverse the simplification:

regexp: \x09
replace </a>

And that's it.
Ben

Thanks everyone!

Post by Ben »

I will be exercising Textpad's FIND/REPLACE capabilities a lot this weekend and I'll post again if successful or have more questions! You guys are life-savers!

Ben
Post Reply