How to remove extra spaces, gaps, line breaks, etc.

Ben · Post by **Ben** » Tue May 01, 2001 8:33 pm

Randall and Andreas' regexp's work beautifully on correctly developed HTML. Unfortunately they are stumbling on the poorly coded HTML that someone else developed in FrontPage. There are extra line breaks and spaces everywhere, prematurely reducing the number of matches found. I know how to remove every single line break and gap in a file, but then it becomes hard to read without a few logically placed line breaks and tab indents. I'm looking for a more accurate cure that limits the collateral damage I cause to the surrounding code. Can anyone help me with regexp's that get rid of the following:

1) Extra spaces that occur in the middle of tags or text strings, such as:

<a href="homepage.html">Contact List</a>
<a href="homepage.html">Contact List</a>

2) Extra line breaks that occur BETWEEN different tags (but I don't want to remove every line break in the page, and I don't want to disturb the existing tab structure if possible):

<a href="homepage.html">
Contact List</a>

3) Extra line breaks that occur WITHIN tags:

<a
href="homepage.html">Contact List</a>

4) If different from above, I want to remove any extra line breaks that occur WITHIN text too:

<a href="homepage.html">Contact
List</a>

I have tested some regexp's of my own (using tagged expressions and combinations of [:space:], [:blank:], and \n), but they are too sensitive and change more code than I want. I realize some of these things may require multiple passes. Thanks in advance for any help.

Ben

Ben · Post by **Ben** » Tue May 01, 2001 8:35 pm

Substitute *space* for "_"

1) Extra spaces that occur in the middle of tags or text strings, such as:

<a_____________href="homepage.html">Contact List</a>
<a href="homepage.html">Contact_____________List</a>

Ben

Ben · Post by **Ben** » Tue May 01, 2001 11:13 pm

Silly to reply to myself, but the following FIND expression seems to get rid of the extra spaces:

[ \t][ \t]+

However, it tends to remove any pseudo-spaces used for tab indents. Any other ways to accomplish this?

Any suggestions for the other items above?

Ben

Andreas · Post by **Andreas** » Wed May 02, 2001 8:43 am

use
\([^ \t]\)[ \t][ \t]+
and replace by
\1
i.e \1 followed by a space.
This will find one non-space/tab followed by more than one space or tab
and replace it with the found character plus a space

Randall McDougall · Post by **Randall McDougall** » Wed May 02, 2001 12:41 pm

To remove hard breaks imbedded in tags (#3 above) use:

Regexp: \(<[^>]*\)\n
Replace: \1_

"_"=a space ... if the tag is broken across several lines (ie. one parameter to a line or some such) you'll have to repeat, so repeat until not found works best.

#2 & #4 are harder ... for two reasons: (a) you want to match only certain tags [doing <body> or would not be what you want, so there's the question of *which* -- so far you're primarily interested in anchor tags it seems] and (b) the matching criteria is for a negative condition which is not "intuitive" with regexps and usually needs some transformation/change/de-transform to work well ... I'll give it some thought and reply in a bit.

Ben · Post by **Ben** » Wed May 02, 2001 1:31 pm

...and the regexps worked great so far. I see how 2 and 4 may be trickier. Let's see if I can help limit the constraints further for each.

2) Extra line breaks that occur BETWEEN different tags DIRECTLY FOLLOWING or WITHIN a set of anchor tags. Worst case scenario might be something like:

<a href="homepage.html">

Contact List
</a>

(where the occurrence of breaks is not necessarily consistent for a set of opening and closing tags).

4) Extra line breaks that occur WITHIN anchor tag link text ONLY:

<a href="homepage.html">Contact
List</a>

Ben

Ben · Post by **Ben** » Wed May 02, 2001 1:48 pm

The regexp's you guys developed for cleaning up the extra tags in title attributes:

FIND: \(title="Link to [^<"]*\)<[^>]*>\([^"]*\)
REPLACE: \1\2

...won't find any matches that happen to include an non-breaking spaces ( ) in the FIND expression. This seems weird because the regexp that finds and generates the title attribute doesn't care about any non-breaking spaces, only the regexp I use afterward to clean it up.

Example:

<a href="homepage.html" title=" Contact List"> Contact List
</a>

Any idea as to what is tripping up Textpad in this instance? Thanks again for all the great help! I only know enough about regexp's to get myself in trouble, so I must humbly yield to the masters.

Ben

Ben · Post by **Ben** » Wed May 02, 2001 1:51 pm

Substitute "nbsp" for the "XXXX" characters in the example below:

<a href="homepage.html" title="&XXXX;&XXXX;Contact List">&XXXX;&XXXX;Contact List
</a>

Ben

Randall McDougall · Post by **Randall McDougall** » Fri May 04, 2001 9:31 am

Sorry it took so long to get back to you on this, life's been ... interesting.

For the last question first, and "nbsp" is not an actual space until interpreted by a browser for display, to a regexp it's just a string of characters... as part of your Title attribute they're irrelevant, so you should convert them to real spaces:

regexp: \(<a [^>]*title="[^"]*\)\(&_nbsp_;\)+
replace: \1_

where: &_nbsp_; has the "_" to keep the forum from actually converting to a space character, and "_" in the replace is a trailing space.

I think we can deal with 2 & 4 as being basically the same problem: remove all line breaks between any anchor [a] tags and corresponding end anchor [/a] tags ... to do that first you need to "simplify" the end-anchors temporarily:

regexp: </a>
replace \x09

now, repeat the following until it fails to match anything:

regexp: \(<a [^\x09]*\)\n\(.\)
replace: \1 \2

and you're almost done *unless* there are anchor tags imbedded in other anchor tags [a bad idea, but not impossible or without some uses] that contain line breaks after the inside tag closes -- *that* could be nearly impossible to fix automatically...

find: <a [^\x09]*<a

to see if you've got any of those (and fix them yourself -- there certainly shouldn't be many (if there are it's possible to convert them to a side-by-side layout, but the regexp for that is messier than it's worth if you don't have to ...

now you'll have to reverse the simplification:

regexp: \x09
replace </a>

And that's it.

Ben · Post by **Ben** » Fri May 04, 2001 1:02 pm

I will be exercising Textpad's FIND/REPLACE capabilities a lot this weekend and I'll post again if successful or have more questions! You guys are life-savers!

Ben

Community

How to remove extra spaces, gaps, line breaks, etc.

How to remove extra spaces, gaps, line breaks, etc.

Correction to example number 1

Can someone enhance this?

Re: Can someone enhance this?

Re: Can someone enhance this?

OK, you guys nailed 1 and 3 . . .

Just thought of one more roadblock

Correction...

Re: Correction...

Thanks everyone!