Need help with Find/Replace Regular Expressions

Ben · Post by **Ben** » Fri Apr 27, 2001 3:08 am

Hi, I hope someone here can help me. I am attempting to save some time by creating a global find/replace procedure that will automatically add "title" attributes to every HTML hyperlink tag in a page/site. Obviously, each link is different and I want to take advantage of the tagged expressions to "remember" the specific URLs, etc.

The link structure presently looks like this:

<a href="table.html" target="_top" class="main">Application Information</a>

I want it to look like this:

<a href="table.html" target="_top" class="main" title="Link to Application Information page">Application Information</a>

... where 'title="Link to XXXX page" ' is the part I want automatically generated by the link text.

I have worked out the following find/replace expressions:

FIND: <a href="$.*$">$.*$</a>
REPLACE: <a href="\1" title="Link to \2 page">\2</a>

So far it works fine but it will add any formatting tags like boldface () or () that occur in between the <a> and </a> tags. This is my first problem (how to strip the extra tags from the "\2" expression I use in the title attribute).

My second major problem is that because the wildcard ".*" always finds the largest matching expression on a line, so if 2 or more links occur on the same line, it finds and highlights the whole batch of links and treats it as one matching object!

One of my more rigorous examples is:

<td width="124" align="left" valign="middle">
<a href="http://root/directory/Main_Homepage.htm" target="_top" class="red_no">Home</a></td></tr><tr><td width="8" align="left"> </td><td width="132" align="left" colspan="2"><a href="Applications.html" target="_top" class="black_no">Applications</a></td>

(Yes I know the code is messy, but someone else developed it in FrontPage and it's not my job to fix it all.) Can anyone help with the find/replace part of my problem? Any handy macros already out there?

Thanks in advance,
Ben

Andreas · Post by **Andreas** » Fri Apr 27, 2001 6:39 am

to avoid catching 2 or more links in one replace, you need character :

<a href="$[^"]*$">$.*$</a>

[abc] matches a or b or c
[^abc] matches any character except a, b and c.
so
[^"] matches any character except "

the second problem (html within the anchor) is more difficult.
I don't think this can be done in one go.

After the first replace, try the following
Find
title="$[^"]*$<[^>"]*>$[^"]*$"
and replace it with
title="\1\2"

This finds
title="
followed by zero or more non-quotes
<
followed by zero or more non-> and non-quotes
>
followed by zero or more non-quotes
"

This will take out one tag at a time,
so repeat the replacement till there are no more matches.
title-attributes without tags inside won't get changed by this.

But be careful, this will affect every title-attribute, not only those in anchor tags - extend the expressions if you need only those title attributes within anchor tags

Andreas

Ben · Post by **Ben** » Fri Apr 27, 2001 1:00 pm

The FIND expression you specified for my first problem:

<a href="$[^"]*$">$.*$</a>

... does not seem to work for me (it actually can't find any matching regular expressions for the above parameters). Plus I'm not sure if constraining my match to the quotation mark will solve the problem because doesn't the asterisk parameter still force TextPad to find the largest possible match on a line (even if you have 3 individual matches on that same line)?

Ben

Randall McDougall · Post by **Randall McDougall** » Sat Apr 28, 2001 7:02 am

The reason that expression couldn't match is becase your anchor tag is already more complex (according to your example) than it allows for. The [^"]* will work as advertised to limit the scope of the first part, but you're right: the .* that remains is still a problem, and not so easily dealt with ... however, since removing the imbedded tags is a multi-part process, I don't see any reason to try to do the impossible ^_^ ... First move all the anchor end tags to new lines:

regexp: </a>
replace: \n\0

(afterward you can replace all "\n</a>" with just "</a>" again if you like -- though it won't matter for the display)

After that you'd want to use:

regexp: <$a href=[^>]+$>$.*$$
replace: <\1 title="\2">\2

and proceed to remove any imbedded tags as suggested above ...

Ben · Post by **Ben** » Mon Apr 30, 2001 6:03 pm

The expressions:

FIND: title="$[^"]*$<[^>"]*>$[^"]*$"
REPLACE: title="\1\2"

are not working for me. In theory, it looks like it should find all the extra tags that occur in the newly created title attrubutes. But it doesn't find anything in any of my html pages so I must be missing something. Another problem I think I see with this expression is that it will trip on the title tags that happen to include quotation marks from within the other tags I am trying to remove. For instance, when using Randall's expressions:

<a href="link.html"><img src="bullet.gif" alt="bullet graphic">Contact List</a>

becomes --->

<a href="link.html" title="Link to <img src="bullet.gif" alt="bullet graphic">Contact List"><img src="bullet.gif">Contact List</a>

What part of the expression do I need to change to correctly remove all of the extra tags? Right now I cannot find any matches for the above expression, so I can only test parts of it. Or is there another expression that might work better?

Ultimately, I want to end up with this:

<a href="link.html" title="Link to Contact List"><img src="bullet.gif">Contact List</a>

Thanks again for all of the help on this problem.

Ben

Ben · Post by **Ben** » Mon Apr 30, 2001 6:05 pm

the bullet graphic images were supposed to be examples that read like this (substitute "<>" for "[]")

<a href="link.html" title="Link to

Ben · Post by **Ben** » Tue May 01, 2001 3:36 pm

Can I modify the FIND statement to capture those matches that happen to have accidental hard returns in the middle of tags? Of course I cannot predict where an accidental return would occur in a tag (I see them sprinkled everywhere).

Or would it be better to remove all hard returns from the HTML document, and then use the appropriate FIND expressions?

Ben

Randall McDougall · Post by **Randall McDougall** » Tue May 01, 2001 6:08 pm

Regexp: $title="Link to [^<"]*$<[^>]*>$[^"]*$
Replace: \1\2

Should do it for the original question (but will have to be repeated until it fails of course) ... as for hard breaks, run this first:

Regexp: $<[^>]*$\n
Replace: \1

(*NOTE* that there's a *space* after the \1 in the replace)

Community

Need help with Find/Replace Regular Expressions

Need help with Find/Replace Regular Expressions

Re: Need help with Find/Replace Regular Expressions

Hmmm, I think you solved the 2nd issue but....

Re: Hmmm, I think you solved the 2nd issue but....

OK, I thought we almost had it but . . .

whoops, this forum picks up HTML

Also, is there any way to get around hard returns?

Re: Also, is there any way to get around hard returns?