Unicode Conformance
Posted: Fri Nov 21, 2003 12:41 pm
Currently we have a situation now whereby certain documents - and I mean plain text documents, nothing fancy - can be successfully edited in Notepad but not in TextPad. A single math character in a document can cause this. I humbly suggest that this is not an ideal situation in which to find ourselves.
Unicode "conformance" is not the same thing as full support of Unicode, which I describe in a separate thread. In fact, Unicode "conformance" doesn't mean very much at all - an application can truthfully claim to be Unicode conformant even if it only supports one character! (Yes, really). But even though being conformant isn't really the same thing as being useful, the fact is, it does mean something. It's a basic, minimum requirement for anything claiming to support Unicode, and a first step toward full support, which may be added later.
Conformance means basically two things: (1) that you never replace a valid Unicode character with a different one just because it's not in your "supported subset", and (2) that you never declare an encoding incorrectly. Sadly, TextPad currently fails on both counts.
Let's tackle (1) first. The requirement essentially says that if you load a text file into TextPad, do some editing, and then re-save the file, TextPad should not corrupt any of the characters in the file. The only characters modified should be the ones you edited. It is *NOT* conformant, for instance, to replace characters you don't recognise with '?' or some other "unknown character" glyph.
What TextPad currently does is HIGHLY non-conformant. The dialog which pops up when you try to load a document which TextPad can't handle says: "WARNING: <filename> contains characters that do not exist in code page 1252 (ANSI - Latin I). They will be converted to the system default character, if you click OK." This is bad. Of course it is perfectly reasonable to DISPLAY unknown characters using an unknown character glyph, but it is not okay to replace them. They should be preserved. It should even be possible to cut and paste them from one part of a document to another, or across documents, or to and from other applications, without corrupting any Unicode character.
How is it possible to do this in TextPad? Well - although I haven't seen TextPad's source code, I surmise that each character is stored internally using eight bits. Problem is, Unicode characters currently need 21 bits, so realistically, you need to be storing all characters internally, both in open documents and in the clipboard, as 32-bit wide words. Now one objection to doing this is that it would quadruple the memory requirements of stored text. I argue that this is irrelevant. These days, computers have RAM aplenty. Computers are sold to people who demand flashy graphics, animations, music and so on. In reality, while quadrupling the size of data would be a serious problem for a DVD video, it's simply no big deal for a text file. How many text files do you encounter that won't fit on a floppy disk, for example? I humbly suggest that this is not a problem.
An alternative approach that TextPad might take is to store things internally in UTF-8 or UTF-16. This would certainly reduce the storage requirements, but it's not an approach I would recommend. There are advantages to being able to "see", without further decoding, the character you're dealing with. Many of the algorithms I mention in my other thread (What Unicode Conformance Isn't) will be easier to implement (and faster) if you're not constantly encoding and decoding all the time.
Now let's tackle (2). The Edit / Copy Other / As a HTML page feature of TextPad, while certainly very cool, gets the encoding wrong. It incorrectly declares the encoding to be ISO-8859-1, when in fact, there should be no encoding declared at all. YES THAT'S CORRECT - I said there should be no encoding at all! The reality is that this feature copies the highlighted text, after conversion to HTML, INTO THE CLIPBOARD. Fact is, the clipboard is perfectly capable of storing all Unicode characters without corruption. The clipboard is already conformant! Only when the user subsequently pastes the clipboard into another document AND THEN saves it to disk does an encoding come into play. The final encoding will not be known until the user clicks on File / Save As. Conseqently, the CORRECT thing to do would be to retain all characters in the clipboard as Unicode characters; further retain them as Unicode characters when pasting them into the document; and finally deciding the encoding only at Save As time. This suggests that a META tag which declares a fixed encoding at paste-time is absolutely the wrong thing to do. (The change isn't an enhancement, by the way, it's a bug fix).
So, why should you vote for Unicode Conformance? From what you've read above, you'd be voting for a possible quadrupling of TextPad's memory requirements, and the only real benefit you'll see is that it won't corrupt non-codepage-1252 characters (but you still won't necessarily be able to see them displayed properly). Well, for now, that's all I'm suggesting, and all I'm asking you to vote for. But in the long term future, other, more sophisticated enhancements may eventually become possible if this one is implemented. For some ideas of what might one day be possible, see my other post (What Unicode Conformance Isn't). There you will find all of the features and algorithms we REALLY want - but simple Unicode conformance has to come first. It does. It really does. You can't leap straight from little or no Unicode features to full blown Unicode support in one minor upgrade. Read that thread, to understand what's involved.
So I'm suggesting that we vote for TextPad to become conformant to the Unicode standard, not necessarily for its immediate benefits, but the potential for other enhancements which it opens up in the future.
The poll is, therefore: Should TextPad become Uniocde Conformant?
Unicode "conformance" is not the same thing as full support of Unicode, which I describe in a separate thread. In fact, Unicode "conformance" doesn't mean very much at all - an application can truthfully claim to be Unicode conformant even if it only supports one character! (Yes, really). But even though being conformant isn't really the same thing as being useful, the fact is, it does mean something. It's a basic, minimum requirement for anything claiming to support Unicode, and a first step toward full support, which may be added later.
Conformance means basically two things: (1) that you never replace a valid Unicode character with a different one just because it's not in your "supported subset", and (2) that you never declare an encoding incorrectly. Sadly, TextPad currently fails on both counts.
Let's tackle (1) first. The requirement essentially says that if you load a text file into TextPad, do some editing, and then re-save the file, TextPad should not corrupt any of the characters in the file. The only characters modified should be the ones you edited. It is *NOT* conformant, for instance, to replace characters you don't recognise with '?' or some other "unknown character" glyph.
What TextPad currently does is HIGHLY non-conformant. The dialog which pops up when you try to load a document which TextPad can't handle says: "WARNING: <filename> contains characters that do not exist in code page 1252 (ANSI - Latin I). They will be converted to the system default character, if you click OK." This is bad. Of course it is perfectly reasonable to DISPLAY unknown characters using an unknown character glyph, but it is not okay to replace them. They should be preserved. It should even be possible to cut and paste them from one part of a document to another, or across documents, or to and from other applications, without corrupting any Unicode character.
How is it possible to do this in TextPad? Well - although I haven't seen TextPad's source code, I surmise that each character is stored internally using eight bits. Problem is, Unicode characters currently need 21 bits, so realistically, you need to be storing all characters internally, both in open documents and in the clipboard, as 32-bit wide words. Now one objection to doing this is that it would quadruple the memory requirements of stored text. I argue that this is irrelevant. These days, computers have RAM aplenty. Computers are sold to people who demand flashy graphics, animations, music and so on. In reality, while quadrupling the size of data would be a serious problem for a DVD video, it's simply no big deal for a text file. How many text files do you encounter that won't fit on a floppy disk, for example? I humbly suggest that this is not a problem.
An alternative approach that TextPad might take is to store things internally in UTF-8 or UTF-16. This would certainly reduce the storage requirements, but it's not an approach I would recommend. There are advantages to being able to "see", without further decoding, the character you're dealing with. Many of the algorithms I mention in my other thread (What Unicode Conformance Isn't) will be easier to implement (and faster) if you're not constantly encoding and decoding all the time.
Now let's tackle (2). The Edit / Copy Other / As a HTML page feature of TextPad, while certainly very cool, gets the encoding wrong. It incorrectly declares the encoding to be ISO-8859-1, when in fact, there should be no encoding declared at all. YES THAT'S CORRECT - I said there should be no encoding at all! The reality is that this feature copies the highlighted text, after conversion to HTML, INTO THE CLIPBOARD. Fact is, the clipboard is perfectly capable of storing all Unicode characters without corruption. The clipboard is already conformant! Only when the user subsequently pastes the clipboard into another document AND THEN saves it to disk does an encoding come into play. The final encoding will not be known until the user clicks on File / Save As. Conseqently, the CORRECT thing to do would be to retain all characters in the clipboard as Unicode characters; further retain them as Unicode characters when pasting them into the document; and finally deciding the encoding only at Save As time. This suggests that a META tag which declares a fixed encoding at paste-time is absolutely the wrong thing to do. (The change isn't an enhancement, by the way, it's a bug fix).
So, why should you vote for Unicode Conformance? From what you've read above, you'd be voting for a possible quadrupling of TextPad's memory requirements, and the only real benefit you'll see is that it won't corrupt non-codepage-1252 characters (but you still won't necessarily be able to see them displayed properly). Well, for now, that's all I'm suggesting, and all I'm asking you to vote for. But in the long term future, other, more sophisticated enhancements may eventually become possible if this one is implemented. For some ideas of what might one day be possible, see my other post (What Unicode Conformance Isn't). There you will find all of the features and algorithms we REALLY want - but simple Unicode conformance has to come first. It does. It really does. You can't leap straight from little or no Unicode features to full blown Unicode support in one minor upgrade. Read that thread, to understand what's involved.
So I'm suggesting that we vote for TextPad to become conformant to the Unicode standard, not necessarily for its immediate benefits, but the potential for other enhancements which it opens up in the future.
The poll is, therefore: Should TextPad become Uniocde Conformant?