Page 1 of 2

What Unicode Conformance Isn't

Posted: Fri Nov 21, 2003 12:46 pm
by ramonsky
In this thread I am going to list a number of algorithms and possible TextPad enhancements which go way beyond mere Unicode "Conformance". These are the kinds of things that people expect from Unicode. The ultimate promise, as it were, sadly, rarely delivered. I mention them here because I want people to understand that adding full Unicode support is hard, and nobody should expect it overnight. But if you want to see any or all of these features, you must vote for Unicode Conformance FIRST, because without that, none of this can ever happen.

(1) Case Conversion

From the Edit menu, you can click on Change Case, then you have five different options. But none of these case-change algorithms are true Unicode algorithms. For example, in Unicode, if you highlight the characters "aßa" and click on Edit / Change Case / Upper Case, they should change to "ASSA", because the single character 'ß' changes to the two characters 'SS' when uppercased. A true Unicode application should apply the Unicode ToUppercase algorithm (also ToLowercase and ToTitlecase), as documented in the Unicode Standard.

(2) Case-insensitive Comparison

Similarly, if you click on Search / Find (to get the find dialog) and leave the Match Case checkbox unchecked, we should ideally be doing a Unicode case-insensitive comparison. This means that a search for "assa" should match with "aßa". To achieve this, you need to apply the Unicode Casefolding algorithm. You also need to do some optimisation, because the Unicode Casefold Algorithm is presented in very non-optimal terms. (If you want to compare "ANT" with "elephant", for example, the casefolding algorithm says you should casefold both words, to give "ant" and "elephant", and then compare them, but an optimised implementation only needs to convert the first character!).

(3) Right-to-left Text

Not all text in the world flows from left to right. Some Unicode characters are expected to flow from right to left by default. What's more, you can embed right-to-left text inside left-to-right text, and vice versa. Some characters (for example open and close brackets) are expected to be mirror-reversed when rendered from right-to-left. The whole thing is defined by the Unicode Bidirectional Algorithm. This would be additional functionality for TextPad. In addition to acknowledging the default directionality of Unicode characters, it must also be possible for users to select a default direction for a document, or for a selected region.

(4) Sorting

Highlight some text, click on Tools / Sort. TextPad currently only offers you a lexicographical sort, which means (for example) that the following sequence is considered to be correctly sorted: 'E', 'F', 'e', 'f', 'É', 'é'. In fact, TextPad is sorting on the basis of the character code of each character, not on the basis of local collation criteria. It gets even worse if you ask for a case-insensitive sort. If you do this, our example ends up as: 'e', 'f', 'é'. This is not good. The correct way to do this is by using the Unicode Collation Algorithm. (This was only published literally a few weeks' ago, so I don't think ANY application fully supports this yet). To implement this in TextPad, the Sort dialog would need something of an overhaul. You would need to be able to specify things like whether or not accents are considered significant; whether uppercase sorts before or after lowercase, and so on. (Remember, Unicode sorting has NOTHING to do with codepoint order). All of this is documented in the UCA, and appropriate defaults could easily be set for each locale (for instance the defaults for French would be different from the defaults for English).

(5) Regular Expressions

Regular expressions take on a whole new dimension in Unicode. Expressions like "[[:alpha:]]" must now match Greek and Chinese letters as well as just Latin ones. Word boundaries ("/w" and "/W") are harder to calculate. You must be able to use the "\uHHHH" and "\U00HHHHHH" expressions in regular expressions. Unicode also has EXTRA features which you must be able to specify in regular expressions - if you want to match only against currency symbols you could specify "\p(Sc)"; if you want to match only against characters in the Greek script you could specify "\p(Greek)", and so on. All of this is documented in Unicode Technical Report #18. This would not be a minor enhancement, it would be a PHENOMENAL enhancement.

(6) Normalisation

In Unicode, there is a distinction between "characters" and "glyphs". A glyph is what gets displayed; a character is what gets stored internally. Confusingly, there is more than one way of "spelling" a glyph. For example, the glyph 'é' can be spelt as either the single character U+00E9 (e acute), or as the two-character-sequence U+0065 (e) followed by U+0301 (acute). If, within a given selection, all of your glyphs are stored CONSISTENTLY then the text is said to be "normalised". If everything is in its shortest possible form, it's called "Normalisation Form C" (NFC). If everything is in its longest possible form, it's called "Normalisation Form D" (NFD). NFC uses less space, but NFD makes most of the algorithms listed above go a lot faster.

The implications of this for a text editor are astounding. Text documents may legitimately contain a mixture of NFC, NFD and non-normalised text, and TextPad should not modify any of it without the explicit consent of the user. But on the Edit menu there should be the possibility of converting selected text (or the whole document) to either NFC or NFD. Then you have other decisions to make, such as, if you press the cursor-right button to step over 'é' spelt U+0065 U+0301, should it be possible to position the cursor between the e and the acute? Should it be possible to delete the acute accent without deleting the e? A common sense approach would be to say no, to treat every glyph as indivisible. And as for inserting new characters by typing them, should they be entered into a document in NFC or NFD? This should be an editable preference setting.

It gets worse. Suppose you do a search for 'é'. Should it match U+00E9? Or U+0065 U+0301? Or both? In most normal circumstances, you'd want it to match either, but there may be special circumstances in which you'd like to search for a particular spelling of a glyph, so there would have to be a(nother) checkbox on the Search dialog to accommodate this.

(7) Compatibility Equivalence

There is another form of equivalence, called "compatibility" equivalence. In simple terms, one Unicode character is compatible with another (or with a sequence of others) if they look vaguely the same. (I've oversimplified I know - please don't pull me up on this). So, for example, the Unicode Trademark character is equivalent to "TM".

For this reason, there must be some item on the Edit menu which makes it possible to highlight a section of text and convert it to "compatibility" form. (Strictly speaking, there are two compatibility forms, called NFKC (the shortest encoding) and NFKD (the longest encoding), but it would suffice to let TextPad choose, on the basis of the user's preference between NFC and NFD). For searching, there should be yet another extra checkbox on the Search dialog - ON for a compatibility equivalent match, OFF for a normal match.

(8) Variable Width Characters - EVEN WITH A MONOSPACED FONT

When using a proportional font, we expect all characters to be different widths, so TAB stops are expressed in terms of pixels, but when using a fixed-width font, we expect all characters to be the same width, hence TAB stops are expressed in terms of character positions. Ok - let's use Unicode terminology here, "glyph positions". We imagine a text document to consist of a rectangular array of cells, with each glyph occupying exactly one cell (apart from TAB, which has to occupy more than cell in order to do its job). But some Unicode glyphs are just too damn big to fit in a single cell, so you're stuck with a choice between using bigger cells, or letting some characters occupy more than one cell. Well, the latter approach won the day, so some characters, Han syllables for instance, will occupy two or more successive character cells.

So when you're moving the cursor around the document, the behavior in the presence of wide characters should depend on the existing preference setting: Configure / Preferences / Editor / Constrain the cursor to the text. If constrained, the cursor may never end up inside a wide character. In this sense, they would act a little like TAB (though fixed width).

(9) Font Selection

Displaying a character is harder in Unicode. In current versions of TextPad, you configure a SINGLE font for each document class. This won't do for Unicode. No single font contains every single glyph. A new approach is needed. Instead, for each document class, you must select a SEQUENCE of fonts. Every time TextPad needs to display a glyph, it will try each font in turn, trying to find an image it can display. Eventually, if it runs out of fonts and still hasn't found the image, it may display an "unsupported character" glyph.

Doing this on a per-document-class basis is sensible, but possibly tedious. Other approaches may have to be considered, such as a per-document-class configuration file (like you have for syntax coloring).

(10) Inputting Characters

Microsoft does a reasonable job of Unicode input with its IMEs (Input Method Editor), but the problem with them is that in order to input an arbitrary Unicode character, you'd have to have the right IME installed. For most people, it would be better to have some other, more general method. TextPad already has clipbooks, of course, so it would be a relatively simple matter just to add a few more books to the clip library, containing every Unicode script. In addition, there should be a way of entering an arbitrary Unicode character given its codepoint in hexadecimal.

(11) Private Use Area Characters

This is the bit where you get to write in Klingon. Private Use characters are characters which the Unicode Consortium reserve for private use by consenting parties. So if I decide that I'm going to use a particular PUA character is going to represent (say) a silhouette of the starship Enterprise, and you are happy to go along with that, then I can write documents containing this character, and you can read them. (Nobody else will be able to read them though). In order to use the PUA, I have to provide two things: (1) fonts which allow my characters to be rendered, (2) some sort of input method, for example a clip book, and (3) some sort of table of character properties, so that all of the above algorithms can operate on my characters.

Strange as it may seem, some PUA conventions are already arising. One popular PUA "set" is known as CSUR, and contains, among other things, the Klingon alphabet. So, if you can had a "CSUR drop-in" (containing the CSUR fonts, clip book and character properties table), you could type documents in Klingon, send them to other people, and have them correctly understood.

CONCLUSION

If you want any or all of these enhancements in TextPad, the FIRST STEP is Unicode Conformance. That must come first. Full Unicode support is something too big to be rushed. It must be done slowly, carefully, one step at a time. You want this? Or even some of it? Vote for Unicode Conformance. That won't get you everything, but it's where everything starts.

Jill

Posted: Fri Nov 21, 2003 2:26 pm
by ramonsky
poll is here

Posted: Fri Nov 21, 2003 3:31 pm
by maniac
First of all, you need to keep all of this in one thread - I personally am tired of seeing so many threads on the same topic. Second, most of your posts seem somewhat rude.

Personally, I think moving to Unicode would make a shift farther from what I use Textpad for. Textpad is already too good to rewrite it now, and I think taking these actions would first of all cost a large amount of time and money, and possibly break Textpad's usability. I use Textpad for coding, not for editing unicode documents. I'm sure that's just me, but that's what I do.

Posted: Fri Nov 21, 2003 3:57 pm
by ramonsky
I can only apologise for any appearance of rudeness. Rest assured it was not deliberate. Actually, I've looked back at all my posts, and even in retrospect none of it sounds rude to me, but I'll apologise anyway coz I'm nice like that.

As for putting it all in one thread ... it IS all in one thread. This is it. The separate poll thread is about Unicode Conformance. THIS thread is about additional functionality over and above mere conformance. This thread is not being voted on. It seemed reasonable (to me) to restrict the poll thread to what was actually being voted on, and everything else elsewhere. Was that dumb?

I use TextPad for coding too. I think most of us do. But most modern programming languages use Unicode now. C++ has the type wchar_t, and Unicode string literals (for instance L"hello world"). Python also has built-in Unicode strings. Unicode is the default in Java. What's more, many modern programs embrace internationalization, and us programmers need to deal with this. It is even possible that in the future, programming languages which haven't even been invented yet may well use some of the Unicode mathematical symbols. Well, just my thoughts.

(And my apologies again for appearing rude. I'm still confused about how I did that though).

Jill (not being rude)

Posted: Fri Nov 21, 2003 4:11 pm
by ben_josephs
Maniac

You wrote of ramonsky:

> most of your posts seem somewhat rude

Could you point out where she has been rude? I can't see it. On the contrary, she has put a great deal of effort into her contributions (although it appears that they are ones in which you are not interested).

Posted: Fri Nov 21, 2003 11:08 pm
by talleyrand
I finally had a few minutes to read and attempt to digest ramonsky's observations and let me just say my god my head hurts now!

Posted: Tue Nov 25, 2003 10:01 am
by ramonsky
Well, in reality, I don't think ANY program is going to implement all of the features listed above for a long, long time. Even UniPad (a text editor whose selling point is Unicode features) doesn't implement all of them. Some of the features I list (in partular things like locale-sensitive-sorting, regular expression parsing, and so on) are probably best delegated to the operating system or an external DLL anyway (because it wouldn't be particularly good for Unicode as a whole if every single application had to re-implement these complex algorithms - that's just re-inventing the wheel). So this isn't really TextPad's business - I mentioned them only to stress that full Unicode support is hard, and that no-one should expect to see it, in this or any other application, overnight.

That said, SOME of the features I listed could be done fairly straightforwardly. Think about (9), for instance - Font selection. This would be a relatively small change which wouldn't take much to implement. (BEFORE: only one font per doc-class. AFTER: one or more fonts per doc-class). But it would achieve the ability to display any character, which would be a massive improvement. Remember - if you have conformance, you already have the ability to edit any character. If you have (9), you also have the ability to see any character, instead of merely seeing an "unknown character" symbol.

Also, given (9), it would be a simple matter to implement (10). Because once you've got the ability to display characters correctly, adding more clipbooks is almost no work at all.

Once you have (9) and (10) together, you've got pretty much everything anyone would (realistically) expect of Unicode support in a text editor - the abilities to input it, edit it, and output it.

Sure, the complicated stuff is complicated - but there's easy stuff too, and the easy stuff alone would make a world of difference.

Jill

Hurty head

Posted: Tue Nov 25, 2003 10:03 am
by schnitzi
All that made my head hurty too. It had not occurred to me that there
would be so much that needs to change in order to achieve full Unicode compliance.

But, let's start the process. Full compliance would kick *ss, and expand the customer base. I vote yes.

Let's start slow

Posted: Sat Dec 13, 2003 4:18 am
by lexthang
To me I think I only need the capability to display, preserve (during editing) and save Unicode characters.

To have other functionalities would be nice, but they might be needed by only a small group of people. On the other hand, the cost of implementing them all would be too much, not to mention the newly introduced avalanche of bugs that such a big change will bring about. TextPad is a commercial software, so they'll have to take that into account.

Posted: Wed Jan 07, 2004 2:12 am
by haughki
i could not agree more with ramonsky. textpad absolutely needs better unicode support. i've been using, enjoying, and recommending textpad for almost seven years, and i'm just about to jump ship because of the lack of unicode support. it's unbelievable that _notepad_ offers better unicode support.

the simple changes ramonsky outlines ([9] + [10]) would support all i need to do.

of course, i use textpad for coding as well, and i love it's multi-computer language support. C++, java, python, xml, etc. all very clean and configurable, and very lightweight. yum. but, it's silly that i've had so much trouble generating a simple, editable chinese test string. i finally figured out how to do it, but very cumbersome and as far as i can tell, absolutely language specific. i have to handle about ten different languages in my current job. globalization happened long ago.

and ramonsky, you're not rude at all. thanks for the detail and dedication.

haughki

Need it

Posted: Sun Jan 23, 2005 11:23 am
by LuciR
I really need Unicode support in TextPad.
I have to edit multilanguage XML documents. Now I'm restricted to using other editors for this task. I'd appreciate Unicode support in TextPad even if most of the problems listed in this topic were not solved correctly. If it were possible to open, edit and save Unicode documents, it would be nice start.
Here are my comments on some points:
(1) Case Conversion
Unicode algorithms should be implemented.
(2) Case-insensitive Comparison
Optimization is not necessary in the beginning
(4) Sorting
I think we may stick to the character-code order in the beginning.
(5) Regular Expressions
No enhancements needed in the beginning
(6) Normalization
We may use any fixed normalization for inserting new characters and any fixed search algorith in the beginning (for example, NFD for inserting characters and searching of any spelling of a glyph).
Options may be added later.
(10) Inputting Characters
I think IME is quite enough. People who need to input text in some language most likely have corresponding IME installed.
(11) Private Use Area Characters
In the beginning, it would be enough to preserve character codes during load/edit/save and show as unknown characters.
SUMMARY
The general idea is to get a possibility to edit multilanguage documents as soon as possible, even having some problems unresolved.

Posted: Wed Mar 22, 2006 3:08 am
by dburry
I need this very badly.... and I have for years, and it just keeps getting worse and worse year after year... Conformance + (9) + (10) would indeed fit everything I need for now...

How many years does it take a textpad to screw in a lightbulb???? come on guys....

Posted: Tue Apr 25, 2006 9:48 pm
by Drxenos
ramonsky wrote:C++ has the type wchar_t, and Unicode string literals (for instance L"hello world").
This is untrue. Neither the C nor the C++ standards define the encoding or character set for the wide character types or literals. It is implementation-specific. For Windows this happens to be UTF-16. Likewise, character types are required to be neither ASCII nor ANSI.

Posted: Tue May 02, 2006 2:50 pm
by leegee
[quote="dburry"]I need this very badly.... and I have for years, and it just keeps getting worse and worse year after year... Conformance + (9) + (10) would indeed fit everything I need for now...

How many years does it take a textpad to screw in a lightbulb???? come on guys....[/quote]

Seconded. This is now so important I am about to stop using Textpad after ten years of using nothing less.

Posted: Fri Jun 23, 2006 6:12 am
by I Steal Toast
Personally I'd be happy if I could open a file containing both English and Japanese characters and edit it without the Japanese text becoming "??????". Windows already handles inputting in Japanese, all Textpad needs to do is not turn it into garbage.