Unicode Conformance

ramonsky · Post by **ramonsky** » Fri Nov 21, 2003 12:41 pm

Currently we have a situation now whereby certain documents - and I mean plain text documents, nothing fancy - can be successfully edited in Notepad but not in TextPad. A single math character in a document can cause this. I humbly suggest that this is not an ideal situation in which to find ourselves.

Unicode "conformance" is not the same thing as full support of Unicode, which I describe in a separate thread. In fact, Unicode "conformance" doesn't mean very much at all - an application can truthfully claim to be Unicode conformant even if it only supports one character! (Yes, really). But even though being conformant isn't really the same thing as being useful, the fact is, it does mean something. It's a basic, minimum requirement for anything claiming to support Unicode, and a first step toward full support, which may be added later.

Conformance means basically two things: (1) that you never replace a valid Unicode character with a different one just because it's not in your "supported subset", and (2) that you never declare an encoding incorrectly. Sadly, TextPad currently fails on both counts.

Let's tackle (1) first. The requirement essentially says that if you load a text file into TextPad, do some editing, and then re-save the file, TextPad should not corrupt any of the characters in the file. The only characters modified should be the ones you edited. It is *NOT* conformant, for instance, to replace characters you don't recognise with '?' or some other "unknown character" glyph.

What TextPad currently does is HIGHLY non-conformant. The dialog which pops up when you try to load a document which TextPad can't handle says: "WARNING: <filename> contains characters that do not exist in code page 1252 (ANSI - Latin I). They will be converted to the system default character, if you click OK." This is bad. Of course it is perfectly reasonable to DISPLAY unknown characters using an unknown character glyph, but it is not okay to replace them. They should be preserved. It should even be possible to cut and paste them from one part of a document to another, or across documents, or to and from other applications, without corrupting any Unicode character.

How is it possible to do this in TextPad? Well - although I haven't seen TextPad's source code, I surmise that each character is stored internally using eight bits. Problem is, Unicode characters currently need 21 bits, so realistically, you need to be storing all characters internally, both in open documents and in the clipboard, as 32-bit wide words. Now one objection to doing this is that it would quadruple the memory requirements of stored text. I argue that this is irrelevant. These days, computers have RAM aplenty. Computers are sold to people who demand flashy graphics, animations, music and so on. In reality, while quadrupling the size of data would be a serious problem for a DVD video, it's simply no big deal for a text file. How many text files do you encounter that won't fit on a floppy disk, for example? I humbly suggest that this is not a problem.

An alternative approach that TextPad might take is to store things internally in UTF-8 or UTF-16. This would certainly reduce the storage requirements, but it's not an approach I would recommend. There are advantages to being able to "see", without further decoding, the character you're dealing with. Many of the algorithms I mention in my other thread (What Unicode Conformance Isn't) will be easier to implement (and faster) if you're not constantly encoding and decoding all the time.

Now let's tackle (2). The Edit / Copy Other / As a HTML page feature of TextPad, while certainly very cool, gets the encoding wrong. It incorrectly declares the encoding to be ISO-8859-1, when in fact, there should be no encoding declared at all. YES THAT'S CORRECT - I said there should be no encoding at all! The reality is that this feature copies the highlighted text, after conversion to HTML, INTO THE CLIPBOARD. Fact is, the clipboard is perfectly capable of storing all Unicode characters without corruption. The clipboard is already conformant! Only when the user subsequently pastes the clipboard into another document AND THEN saves it to disk does an encoding come into play. The final encoding will not be known until the user clicks on File / Save As. Conseqently, the CORRECT thing to do would be to retain all characters in the clipboard as Unicode characters; further retain them as Unicode characters when pasting them into the document; and finally deciding the encoding only at Save As time. This suggests that a META tag which declares a fixed encoding at paste-time is absolutely the wrong thing to do. (The change isn't an enhancement, by the way, it's a bug fix).

So, why should you vote for Unicode Conformance? From what you've read above, you'd be voting for a possible quadrupling of TextPad's memory requirements, and the only real benefit you'll see is that it won't corrupt non-codepage-1252 characters (but you still won't necessarily be able to see them displayed properly). Well, for now, that's all I'm suggesting, and all I'm asking you to vote for. But in the long term future, other, more sophisticated enhancements may eventually become possible if this one is implemented. For some ideas of what might one day be possible, see my other post (What Unicode Conformance Isn't). There you will find all of the features and algorithms we REALLY want - but simple Unicode conformance has to come first. It does. It really does. You can't leap straight from little or no Unicode features to full blown Unicode support in one minor upgrade. Read that thread, to understand what's involved.

So I'm suggesting that we vote for TextPad to become conformant to the Unicode standard, not necessarily for its immediate benefits, but the potential for other enhancements which it opens up in the future.

The poll is, therefore: Should TextPad become Uniocde Conformant?

ramonsky · Post by **ramonsky** » Fri Nov 21, 2003 2:25 pm

Discussion of other Unicode possibilities is here. (These are not what you're voting for in this poll though).

seaktf · Post by **seaktf** » Wed Sep 22, 2004 10:35 am

Yup, I totally agree with you

ramonsky wrote: ...... Problem is, Unicode characters currently need 21 bits, so realistically, you need to be storing all characters internally, both in open documents and in the clipboard, as 32-bit wide words.

I suppose it's a typo error there. You wanted to mean 32 bits instead of 21 bits, right?

ben_josephs · Post by **ben_josephs** » Wed Sep 22, 2004 10:51 am

No. Ramonsky is right. It's 21 bits.

Post by **MudGuard** » Wed Sep 22, 2004 10:56 am

No, the 21 is correct.

Currently, the highest possible value of an Unicode character is
HEX 10FFFD which is
decimal 1 114 109 and
binary 1 0000 1111 1111 1111 1101

If you count the number of bits, you will get the result 21.

The 32 bit come from the fact, that
- bits are usually organized in 8-bit Bytes
- processors use bytes usually in groups of size 2 to the power of n (i.e. 1, 2, 4, 8, 16)
The smallest such byte-group which can hold 21 bits is the one with 4 bytes of 8 bits, i.e. 32 bit.

seaktf · Post by **seaktf** » Wed Sep 22, 2004 4:29 pm

OK, I've just spotted the word *current" in messages of both of you.

I see what you meant. Currently, the *defined* range in Unicode 4 is just U+0000 to U+10FFFD, which allows a little bit less than 2 millions characters (excluding Private Use Areas, http://www.unicode.org/charts/ ), but that doesn't mean Unicode Consortium isn't going to extend it in the future. Who knows, maybe tomorrow we'll get into contact with ET and millions of new characters would be added

Post by **MudGuard** » Wed Sep 22, 2004 5:21 pm

You assume that ET can write ... :lol: :lol: :lol: :lol: :lol:

JAB Creations · Post by **JAB Creations** » Tue Feb 21, 2006 3:12 am

This is definitely something I would like to see supported! I have language translations for my site and I have 26 various languages all working simultaneously in notepad but not Textpad? Textpad either supports Unicode or not, and it currently does not.

dburry · Post by **dburry** » Wed Mar 22, 2006 3:03 am

I need this very badly.... many years of being a staunch textpad supporter, the world is passing you by, everything's becoming internationalized, even the products where I work... I keep having to jump ship and use a vastly inferior editor just to edit my files... and it's been this way for HOW MANY YEEEAAAARS??? come on.....

Fredrc · Post by **Fredrc** » Fri Mar 23, 2007 12:05 am

This is depressing, I love Textpad. I NEeeeeed Unicode support. Arrghhh.

Sorry for venting. I was so excited to see and update, I installed it, still my pages can't be opened without "simple" characters getting trashed. I have been using Notepad2 for these txt files when I need a work around. But I want a full time editor. Why oh why wasn't this problem solved in version 5.

Now I really have to start shopping, and learning new command keys. Arghhhh.

devdanke · Post by **devdanke** » Mon Apr 16, 2007 10:05 am

Let's hope Helios rewards it's loyal customers with better Unicode support in an upcoming TextPad 5.x releases. I've got my fingers crossed.

Drxenos · Post by **Drxenos** » Mon Apr 16, 2007 11:12 am

A while back, someone posted a good link on Unicode. I cannot find it. Can someone point me to it?

DrX

ben_josephs · Post by **ben_josephs** » Mon Apr 16, 2007 12:00 pm

Google is your friend. Wikipedia is a good place to start: http://en.wikipedia.org/wiki/Unicode. It includes a link to the home of Unicode: http://unicode.org/.

Drxenos · Post by **Drxenos** » Mon Apr 16, 2007 2:55 pm

ben_josephs wrote:Google is your friend. Wikipedia is a good place to start: http://en.wikipedia.org/wiki/Unicode. It includes a link to the home of Unicode: http://unicode.org/.

Um, yes, I know that. Someone has post here a link to a nice summary on Unicode. I remember it has some great info on common misconceptions.

DrX

jporter · Post by **jporter** » Wed Apr 25, 2007 2:07 pm

I work at special services at SDL Netherlands and we do a lot of localised work in XML files which are loaded in our system and exported.

We have bought a few Textpad licenses and the people here love it, they set up macro's and use it everyday. Now this is all great untill unicode xml files are used. Then people have to startup Notepad and watch out we don't open and save the translated files with Textpad because it "breaks" them as there is no unicode support.

I just said to my colleague, "look Textpad 5 is out" he got all excited because maybe there was unicode support haha. When he saw that there was again no unicode support he was really dissapointed, as am I.

Supporting unicode would really get more users to textpad and would make the package more complete. For now I suggest stop making UI changes, and get that unicode support in!

We vote yes.

Community

Unicode Conformance

Should TextPad become Unicode Conformant

Unicode Conformance

Re: Unicode Conformance

Where is it. I waited faithfully for TextPad 5

Hoping for a Unicode future