How to remove extra spaces, gaps, line breaks, etc.
Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard
How to remove extra spaces, gaps, line breaks, etc.
Randall and Andreas' regexp's work beautifully on correctly developed HTML. Unfortunately they are stumbling on the poorly coded HTML that someone else developed in FrontPage. There are extra line breaks and spaces everywhere, prematurely reducing the number of matches found. I know how to remove every single line break and gap in a file, but then it becomes hard to read without a few logically placed line breaks and tab indents. I'm looking for a more accurate cure that limits the collateral damage I cause to the surrounding code. Can anyone help me with regexp's that get rid of the following:
1) Extra spaces that occur in the middle of tags or text strings, such as:
<a href="homepage.html">Contact List</a>
<a href="homepage.html">Contact List</a>
2) Extra line breaks that occur BETWEEN different tags (but I don't want to remove every line break in the page, and I don't want to disturb the existing tab structure if possible):
<a href="homepage.html">
<font face="arial">Contact List</font></a>
3) Extra line breaks that occur WITHIN tags:
<a
href="homepage.html">Contact List</a>
4) If different from above, I want to remove any extra line breaks that occur WITHIN text too:
<a href="homepage.html">Contact
List</a>
I have tested some regexp's of my own (using tagged expressions and combinations of [:space:], [:blank:], and \n), but they are too sensitive and change more code than I want. I realize some of these things may require multiple passes. Thanks in advance for any help.
Ben
1) Extra spaces that occur in the middle of tags or text strings, such as:
<a href="homepage.html">Contact List</a>
<a href="homepage.html">Contact List</a>
2) Extra line breaks that occur BETWEEN different tags (but I don't want to remove every line break in the page, and I don't want to disturb the existing tab structure if possible):
<a href="homepage.html">
<font face="arial">Contact List</font></a>
3) Extra line breaks that occur WITHIN tags:
<a
href="homepage.html">Contact List</a>
4) If different from above, I want to remove any extra line breaks that occur WITHIN text too:
<a href="homepage.html">Contact
List</a>
I have tested some regexp's of my own (using tagged expressions and combinations of [:space:], [:blank:], and \n), but they are too sensitive and change more code than I want. I realize some of these things may require multiple passes. Thanks in advance for any help.
Ben
Correction to example number 1
Substitute *space* for "_"
1) Extra spaces that occur in the middle of tags or text strings, such as:
<a_____________href="homepage.html">Contact List</a>
<a href="homepage.html">Contact_____________List</a>
Ben
1) Extra spaces that occur in the middle of tags or text strings, such as:
<a_____________href="homepage.html">Contact List</a>
<a href="homepage.html">Contact_____________List</a>
Ben
Can someone enhance this?
Silly to reply to myself, but the following FIND expression seems to get rid of the extra spaces:
[ \t][ \t]+
However, it tends to remove any pseudo-spaces used for tab indents. Any other ways to accomplish this?
Any suggestions for the other items above?
Ben
[ \t][ \t]+
However, it tends to remove any pseudo-spaces used for tab indents. Any other ways to accomplish this?
Any suggestions for the other items above?
Ben
Re: Can someone enhance this?
use
\([^ \t]\)[ \t][ \t]+
and replace by
\1
i.e \1 followed by a space.
This will find one non-space/tab followed by more than one space or tab
and replace it with the found character plus a space
\([^ \t]\)[ \t][ \t]+
and replace by
\1
i.e \1 followed by a space.
This will find one non-space/tab followed by more than one space or tab
and replace it with the found character plus a space
Re: Can someone enhance this?
To remove hard breaks imbedded in tags (#3 above) use:
Regexp: \(<[^>]*\)\n
Replace: \1_
"_"=a space ... if the tag is broken across several lines (ie. one parameter to a line or some such) you'll have to repeat, so repeat until not found works best.
#2 & #4 are harder ... for two reasons: (a) you want to match only certain tags [doing <body> or would not be what you want, so there's the question of *which* -- so far you're primarily interested in anchor tags it seems] and (b) the matching criteria is for a negative condition which is not "intuitive" with regexps and usually needs some transformation/change/de-transform to work well ... I'll give it some thought and reply in a bit.
Regexp: \(<[^>]*\)\n
Replace: \1_
"_"=a space ... if the tag is broken across several lines (ie. one parameter to a line or some such) you'll have to repeat, so repeat until not found works best.
#2 & #4 are harder ... for two reasons: (a) you want to match only certain tags [doing <body> or would not be what you want, so there's the question of *which* -- so far you're primarily interested in anchor tags it seems] and (b) the matching criteria is for a negative condition which is not "intuitive" with regexps and usually needs some transformation/change/de-transform to work well ... I'll give it some thought and reply in a bit.
OK, you guys nailed 1 and 3 . . .
...and the regexps worked great so far. I see how 2 and 4 may be trickier. Let's see if I can help limit the constraints further for each.
2) Extra line breaks that occur BETWEEN different tags DIRECTLY FOLLOWING or WITHIN a set of anchor tags. Worst case scenario might be something like:
<a href="homepage.html">
<font face="arial">
<b>Contact List</b></font>
</a>
(where the occurrence of breaks is not necessarily consistent for a set of opening and closing tags).
4) Extra line breaks that occur WITHIN anchor tag link text ONLY:
<a href="homepage.html">Contact
List</a>
Ben
2) Extra line breaks that occur BETWEEN different tags DIRECTLY FOLLOWING or WITHIN a set of anchor tags. Worst case scenario might be something like:
<a href="homepage.html">
<font face="arial">
<b>Contact List</b></font>
</a>
(where the occurrence of breaks is not necessarily consistent for a set of opening and closing tags).
4) Extra line breaks that occur WITHIN anchor tag link text ONLY:
<a href="homepage.html">Contact
List</a>
Ben
Just thought of one more roadblock
The regexp's you guys developed for cleaning up the extra tags in title attributes:
FIND: \(title="Link to [^<"]*\)<[^>]*>\([^"]*\)
REPLACE: \1\2
...won't find any matches that happen to include an non-breaking spaces ( ) in the FIND expression. This seems weird because the regexp that finds and generates the title attribute doesn't care about any non-breaking spaces, only the regexp I use afterward to clean it up.
Example:
<a href="homepage.html" title=" Contact List"> Contact List
</a>
Any idea as to what is tripping up Textpad in this instance? Thanks again for all the great help! I only know enough about regexp's to get myself in trouble, so I must humbly yield to the masters.
Ben
FIND: \(title="Link to [^<"]*\)<[^>]*>\([^"]*\)
REPLACE: \1\2
...won't find any matches that happen to include an non-breaking spaces ( ) in the FIND expression. This seems weird because the regexp that finds and generates the title attribute doesn't care about any non-breaking spaces, only the regexp I use afterward to clean it up.
Example:
<a href="homepage.html" title=" Contact List"> Contact List
</a>
Any idea as to what is tripping up Textpad in this instance? Thanks again for all the great help! I only know enough about regexp's to get myself in trouble, so I must humbly yield to the masters.

Ben
Correction...
Substitute "nbsp" for the "XXXX" characters in the example below:
<a href="homepage.html" title="&XXXX;&XXXX;Contact List">&XXXX;&XXXX;Contact List
</a>
Ben
<a href="homepage.html" title="&XXXX;&XXXX;Contact List">&XXXX;&XXXX;Contact List
</a>
Ben
Re: Correction...
Sorry it took so long to get back to you on this, life's been ... interesting.
For the last question first, and "nbsp" is not an actual space until interpreted by a browser for display, to a regexp it's just a string of characters... as part of your Title attribute they're irrelevant, so you should convert them to real spaces:
regexp: \(<a [^>]*title="[^"]*\)\(&_nbsp_;\)+
replace: \1_
where: &_nbsp_; has the "_" to keep the forum from actually converting to a space character, and "_" in the replace is a trailing space.
I think we can deal with 2 & 4 as being basically the same problem: remove all line breaks between any anchor [a] tags and corresponding end anchor [/a] tags ... to do that first you need to "simplify" the end-anchors temporarily:
regexp: </a>
replace \x09
now, repeat the following until it fails to match anything:
regexp: \(<a [^\x09]*\)\n\(.\)
replace: \1 \2
and you're almost done *unless* there are anchor tags imbedded in other anchor tags [a bad idea, but not impossible or without some uses] that contain line breaks after the inside tag closes -- *that* could be nearly impossible to fix automatically...
find: <a [^\x09]*<a
to see if you've got any of those (and fix them yourself -- there certainly shouldn't be many (if there are it's possible to convert them to a side-by-side layout, but the regexp for that is messier than it's worth if you don't have to ...
now you'll have to reverse the simplification:
regexp: \x09
replace </a>
And that's it.
For the last question first, and "nbsp" is not an actual space until interpreted by a browser for display, to a regexp it's just a string of characters... as part of your Title attribute they're irrelevant, so you should convert them to real spaces:
regexp: \(<a [^>]*title="[^"]*\)\(&_nbsp_;\)+
replace: \1_
where: &_nbsp_; has the "_" to keep the forum from actually converting to a space character, and "_" in the replace is a trailing space.
I think we can deal with 2 & 4 as being basically the same problem: remove all line breaks between any anchor [a] tags and corresponding end anchor [/a] tags ... to do that first you need to "simplify" the end-anchors temporarily:
regexp: </a>
replace \x09
now, repeat the following until it fails to match anything:
regexp: \(<a [^\x09]*\)\n\(.\)
replace: \1 \2
and you're almost done *unless* there are anchor tags imbedded in other anchor tags [a bad idea, but not impossible or without some uses] that contain line breaks after the inside tag closes -- *that* could be nearly impossible to fix automatically...
find: <a [^\x09]*<a
to see if you've got any of those (and fix them yourself -- there certainly shouldn't be many (if there are it's possible to convert them to a side-by-side layout, but the regexp for that is messier than it's worth if you don't have to ...
now you'll have to reverse the simplification:
regexp: \x09
replace </a>
And that's it.
Thanks everyone!
I will be exercising Textpad's FIND/REPLACE capabilities a lot this weekend and I'll post again if successful or have more questions! You guys are life-savers!
Ben
Ben