Find an unbroken string of anything . . .

no.cache · Post by **no.cache** » Wed Apr 12, 2017 10:43 pm

Containing 5,000 characters-or-longer without a space or tab, cut it (to be pasted into a new file for backup) and do a Save As on the original document.

Thanks friends.

Post by **MudGuard** » Thu Apr 13, 2017 7:54 am

Find (checking Regular exception):

Code: Select all

[^ \t]{5000,}

(there is a space between ^ and \)

Then cut (Ctrl-X), new file (Ctrl-N), Paste (Ctrl-V), Save (Ctrl-S).

no.cache · Post by **no.cache** » Sat Apr 15, 2017 9:26 pm

MudGuard wrote:Find (checking Regular exception): . . .

Thank you Mudguard, as usual you're AWESOME! Ã¢â‚¬â€� in fact you may have suggestions I haven't even thought of, so to that end I'm supplying some context:

This project can be sourced to the HTML code of a web page on the site DigitalTrends, although I could easily find millions more just like it. There are 2,193,480 characters in this one page's code

(now you know why I can't paste it here in the forums!). The HTML file is so huge that the mere act of opening the file in Dreamweaver crashes the program. Over 98% of the code is related to monetizing the site and it is this advertising bloat code Ã¢â‚¬â€� of this one page and others like it Ã¢â‚¬â€� that I hope to automate the means to DELETE. As will soon be apparent, an all-purpose "style stripper" is too arbitrary.

I'm giving you the link to a .ZIP file I just uploaded to EXPIREBOX Ã¢â‚¬â€� a FREE online temporary storage site which you'll need to download from quickly because the link expires in 48-Hours. The .ZIP file contains three files: 1) a PDF graphics reference, 2) an HTML file, and 3) a text version of the HTML file. I strongly recommend that you open the text version of the page first. I've changed nothing in the code; the page I downloaded is exactly as you see it, warts and all.

THE PROJECT
I'd like to have the means at my disposal to automate the removal of all advertising & related code Ã¢â‚¬â€� and events (such as Javascript, PHP etc.) Ã¢â‚¬â€� while keeping the page's essential visual look & style. In this code are hundreds of script and styles data related exclusively to monetizing the site Ã¢â‚¬â€� Class and ID selectors with labels such as

Code: Select all

#ad~

Code: Select all

.Advert IframeAd~

Code: Select all

#TopRightRadvert~

Code: Select all

#google_ads_~

Code: Select all

#showcase_links_~

Code: Select all

div#promo,

Code: Select all

#BBCPH_MCPH_MCPH_P_ArticleAd1,

Code: Select all

id="dt-video-container-2989522218"

I do not expect any search & replace session to collect all of them Ã¢â‚¬â€� indeed, I have to be careful that selectors matched aren't also-or-exclusively related to the page's content Ã¢â‚¬â€� so for obvious reasons I want to make sure that the data I've cut is backed up in its own file. As Mudguard's example above indicates, this is one of those projects where even the most artful Textpad Macro cannot escape the immutable fact of it being one of multiple steps.

On Line 64 you'll see an uninterrupted 56,947-character block-of-code, and this block was the catalyst for my thread. Solid blocks of data such as these Ã¢â‚¬â€� with no spaces or carrier returns Ã¢â‚¬â€� are a matter of seconds to delete. In fact the monetizing of virtually all web pages use two methods, and always include Google:

iframes
scripts
Google

(and Amazon if necessary).

Iframes and scripts can be removed in seconds (at least I hope so!) Ã¢â‚¬â€� it's the selector tags that introduce the greatest challenge. This will be an ongoing project that will need to be edited and perfected over time as the web evolves.

So why am I doing this? Sometimes I like to download technical guides and store them Ã¢â‚¬â€� and their graphics Ã¢â‚¬â€� for my own personal offline use as a reference I can annotate. I want to preserve each site's visual style because their respective page designs help me to associate them in my mind. I'll post my own progress here, but for Mudguard and others who are interested: Download my .ZIP file (in the next 48 hours) and let's play.