Containing 5,000 characters-or-longer without a space or tab, cut it (to be pasted into a new file for backup) and do a Save As on the original document.
Thanks friends.
Find an unbroken string of anything . . .
Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard
Find an unbroken string of anything . . .
Trump.
Jesus wept.
Jesus wept.
Find (checking Regular exception):
(there is a space between ^ and \)
Then cut (Ctrl-X), new file (Ctrl-N), Paste (Ctrl-V), Save (Ctrl-S).
Code: Select all
[^ \t]{5000,}
Then cut (Ctrl-X), new file (Ctrl-N), Paste (Ctrl-V), Save (Ctrl-S).
Thank you Mudguard, as usual you're AWESOME! � in fact you may have suggestions I haven't even thought of, so to that end I'm supplying some context:MudGuard wrote:Find (checking Regular exception): . . .
This project can be sourced to the HTML code of a web page on the site DigitalTrends, although I could easily find millions more just like it. There are 2,193,480 characters in this one page's code (now you know why I can't paste it here in the forums!). The HTML file is so huge that the mere act of opening the file in Dreamweaver crashes the program. Over 98% of the code is related to monetizing the site and it is this advertising bloat code � of this one page and others like it � that I hope to automate the means to DELETE. As will soon be apparent, an all-purpose "style stripper" is too arbitrary.
I'm giving you the link to a .ZIP file I just uploaded to EXPIREBOX � a FREE online temporary storage site which you'll need to download from quickly because the link expires in 48-Hours. The .ZIP file contains three files: 1) a PDF graphics reference, 2) an HTML file, and 3) a text version of the HTML file. I strongly recommend that you open the text version of the page first. I've changed nothing in the code; the page I downloaded is exactly as you see it, warts and all.
THE PROJECT
I'd like to have the means at my disposal to automate the removal of all advertising & related code � and events (such as Javascript, PHP etc.) � while keeping the page's essential visual look & style. In this code are hundreds of script and styles data related exclusively to monetizing the site � Class and ID selectors with labels such as
Code: Select all
#ad~
Code: Select all
.Advert IframeAd~
Code: Select all
#TopRightRadvert~
Code: Select all
#google_ads_~
Code: Select all
#showcase_links_~
Code: Select all
div#promo,
Code: Select all
#BBCPH_MCPH_MCPH_P_ArticleAd1,
Code: Select all
id="dt-video-container-2989522218"
On Line 64 you'll see an uninterrupted 56,947-character block-of-code, and this block was the catalyst for my thread. Solid blocks of data such as these � with no spaces or carrier returns � are a matter of seconds to delete. In fact the monetizing of virtually all web pages use two methods, and always include Google:
- iframes
- scripts
Iframes and scripts can be removed in seconds (at least I hope so!) � it's the selector tags that introduce the greatest challenge. This will be an ongoing project that will need to be edited and perfected over time as the web evolves.
So why am I doing this? Sometimes I like to download technical guides and store them � and their graphics � for my own personal offline use as a reference I can annotate. I want to preserve each site's visual style because their respective page designs help me to associate them in my mind. I'll post my own progress here, but for Mudguard and others who are interested: Download my .ZIP file (in the next 48 hours) and let's play.
Trump.
Jesus wept.
Jesus wept.