Convert syntax colors to HTML

ramonsky · Post by **ramonsky** » Mon Nov 17, 2003 11:22 am

The title probably sounds a bit strange, so I'll explain what I need in more detail. Suppose I write an illustrative piece of C++ code using TextPad. The syntax highlighting is very helpful. So far so good.

Now suppose I want to copy this illustrative code, COMPLETE WITH SYNTAX HIGHLIGHTING, into a web page. It's easy enough to copy and paste a few lines of code, and I can even add <pre> and </pre> by hand so that it still looks like computer code - but I lose both the correct indentation and the colors.

What I end up having to do is reload the text into a WYSIWYG HTML edit, and put the colors back by hand - a tedious and error-prone process, but it's necessary if what you want to end up with is syntax-colored C++ on a web page.

But in all likelihood, TextPad already has enough knowledge in its color parser to do this for me, and I can't imagine it would add much to the footprint just to add something like <SPAN CLASS="keyword2">...</SPAN> around each piece of colored text (say, in a highlighted region or something). So this is a feature which would come "nearly free", as in, it wouldn't take much development time to implement it. (At least, I assume that's the case).

Jill

ben_josephs · Post by **ben_josephs** » Mon Nov 17, 2003 12:09 pm

Select source.
Edit | Copy Other | As a HTML Page
Paste into destination.

Comments to Helios:

1 (trivial). Shouldn't that be "As an HTML Page"?

2 (non-trivial). The character set specified in the output HTML in the meta tag attribute is "iso-8859-1". This is wrong. The character set used by TextPad is a Microsoft invention called CP1252 or WinLatin1. This corresponds with ISO 8859-1 (Latin 1) for the characters 0x00..0x7F and 0xA0..0xFF, but not for the characters 0x80..0x9F, which, in ISO 8859-1, are control codes. Thus, for example, the euro sign "â‚¬" (0x80 in WinLatin1) is misrepresented as a control code. The correct charset, as specified by IANA, is "windows-1252" (http://www.iana.org/assignments/character-sets).

maniac · Post by **maniac** » Mon Nov 17, 2003 6:49 pm

As far as I've seen, the iso-8859-1 character encoding is pretty much standard and used in like 99% of all webpages, since it is the default for pretty much any program that outputs some form of html, and using windows-1252 seems a little more...proprietary. Since HTML is universal, shouldn't everything reflect that?

ben_josephs · Post by **ben_josephs** » Mon Nov 17, 2003 7:16 pm

If the text of a web page does not contain any of the following characters (which may not display correctly if you're not viewing this on a Windows platform), there is no problem.

â‚¬â€šÆ’â€žâ€¦â€ â€¡Ë†â€°Å â€¹Å’Å½â€˜â€™â€œâ€�â€¢â€“â€”Ëœâ„¢Å¡â€ºÅ“Å¾Å¸

If the text does include any of them, then the standard requires that you do not use iso-8859-1. These characters do not exist in iso-8859-1. The fact that windows-1252 is proprietary is exactly the point: these characters are a Microsoft extension to iso-8859-1.

You may care to compare
http://www.microsoft.com/globaldev/refe ... /28591.htm
with
http://www.microsoft.com/globaldev/refe ... s/1252.htm
.

You can, of course, use UTF-8 or some other encoding of Unicode if you need these or any other characters that aren't in iso-8859-1. There are a lot of them. Note: in UTF-8, us-ascii characters (0x00..0x7F) take up a single byte; others take up more.

ramonsky · Post by **ramonsky** » Thu Nov 20, 2003 7:52 am

I do a lot of work with Unicode. The fact is, the meta-tag should NOT be ISO-8859-1 if codepoints U+0080 to U+009F are being used for Microsoft extension characters.

TextPad dudes, excellent though your product is, you now have three choices here: Either (a) declare the document to be WINDOWS-1252, or (b) convert to Unicode. In case (b) you can then choose between encoding the document in UTF-8 and declaring it to be in UTF-8, or encoding non-Latin-1 characters as HTML entities (eg. € for the euro sign).

Please note - this is not a feature enhancement, IT IS A BUG FIX.

Jill

ben_josephs · Post by **ben_josephs** » Thu Nov 20, 2003 9:43 am

Indeed.

... or UTF-16BE or UTF-16LE (or even UTF-32BE or UTF-32LE).

ramonsky · Post by **ramonsky** » Fri Nov 21, 2003 12:51 pm

I've changed my mind about this. For reasons outlined in my other thread (Unicode Conformance), I don't think all this talk of conversion to UTF-8, etc., is appropriate HERE.

The reason is simple. The ENCODING of a document is something which is not known (and cannot be known) until Save (or Save As) time. It is certainly not known at copy or paste time.

Therefore, there is only one CORRECT way you can fix this bug - which is to remove the META tag altogether. A META tag is not a requirement of HTML, and it is better to not specify one than to specify one incorrectly.

It is the User's choice (at Save As time) to decide upon an encoding. And it is the user's responsibiity to ensure that IF they add a META tag, it is correct.

Jill