What Unicode Conformance Isn't
Posted: Fri Nov 21, 2003 12:46 pm
In this thread I am going to list a number of algorithms and possible TextPad enhancements which go way beyond mere Unicode "Conformance". These are the kinds of things that people expect from Unicode. The ultimate promise, as it were, sadly, rarely delivered. I mention them here because I want people to understand that adding full Unicode support is hard, and nobody should expect it overnight. But if you want to see any or all of these features, you must vote for Unicode Conformance FIRST, because without that, none of this can ever happen.
(1) Case Conversion
From the Edit menu, you can click on Change Case, then you have five different options. But none of these case-change algorithms are true Unicode algorithms. For example, in Unicode, if you highlight the characters "aßa" and click on Edit / Change Case / Upper Case, they should change to "ASSA", because the single character 'ß' changes to the two characters 'SS' when uppercased. A true Unicode application should apply the Unicode ToUppercase algorithm (also ToLowercase and ToTitlecase), as documented in the Unicode Standard.
(2) Case-insensitive Comparison
Similarly, if you click on Search / Find (to get the find dialog) and leave the Match Case checkbox unchecked, we should ideally be doing a Unicode case-insensitive comparison. This means that a search for "assa" should match with "aßa". To achieve this, you need to apply the Unicode Casefolding algorithm. You also need to do some optimisation, because the Unicode Casefold Algorithm is presented in very non-optimal terms. (If you want to compare "ANT" with "elephant", for example, the casefolding algorithm says you should casefold both words, to give "ant" and "elephant", and then compare them, but an optimised implementation only needs to convert the first character!).
(3) Right-to-left Text
Not all text in the world flows from left to right. Some Unicode characters are expected to flow from right to left by default. What's more, you can embed right-to-left text inside left-to-right text, and vice versa. Some characters (for example open and close brackets) are expected to be mirror-reversed when rendered from right-to-left. The whole thing is defined by the Unicode Bidirectional Algorithm. This would be additional functionality for TextPad. In addition to acknowledging the default directionality of Unicode characters, it must also be possible for users to select a default direction for a document, or for a selected region.
(4) Sorting
Highlight some text, click on Tools / Sort. TextPad currently only offers you a lexicographical sort, which means (for example) that the following sequence is considered to be correctly sorted: 'E', 'F', 'e', 'f', 'É', 'é'. In fact, TextPad is sorting on the basis of the character code of each character, not on the basis of local collation criteria. It gets even worse if you ask for a case-insensitive sort. If you do this, our example ends up as: 'e', 'f', 'é'. This is not good. The correct way to do this is by using the Unicode Collation Algorithm. (This was only published literally a few weeks' ago, so I don't think ANY application fully supports this yet). To implement this in TextPad, the Sort dialog would need something of an overhaul. You would need to be able to specify things like whether or not accents are considered significant; whether uppercase sorts before or after lowercase, and so on. (Remember, Unicode sorting has NOTHING to do with codepoint order). All of this is documented in the UCA, and appropriate defaults could easily be set for each locale (for instance the defaults for French would be different from the defaults for English).
(5) Regular Expressions
Regular expressions take on a whole new dimension in Unicode. Expressions like "[[:alpha:]]" must now match Greek and Chinese letters as well as just Latin ones. Word boundaries ("/w" and "/W") are harder to calculate. You must be able to use the "\uHHHH" and "\U00HHHHHH" expressions in regular expressions. Unicode also has EXTRA features which you must be able to specify in regular expressions - if you want to match only against currency symbols you could specify "\p(Sc)"; if you want to match only against characters in the Greek script you could specify "\p(Greek)", and so on. All of this is documented in Unicode Technical Report #18. This would not be a minor enhancement, it would be a PHENOMENAL enhancement.
(6) Normalisation
In Unicode, there is a distinction between "characters" and "glyphs". A glyph is what gets displayed; a character is what gets stored internally. Confusingly, there is more than one way of "spelling" a glyph. For example, the glyph 'é' can be spelt as either the single character U+00E9 (e acute), or as the two-character-sequence U+0065 (e) followed by U+0301 (acute). If, within a given selection, all of your glyphs are stored CONSISTENTLY then the text is said to be "normalised". If everything is in its shortest possible form, it's called "Normalisation Form C" (NFC). If everything is in its longest possible form, it's called "Normalisation Form D" (NFD). NFC uses less space, but NFD makes most of the algorithms listed above go a lot faster.
The implications of this for a text editor are astounding. Text documents may legitimately contain a mixture of NFC, NFD and non-normalised text, and TextPad should not modify any of it without the explicit consent of the user. But on the Edit menu there should be the possibility of converting selected text (or the whole document) to either NFC or NFD. Then you have other decisions to make, such as, if you press the cursor-right button to step over 'é' spelt U+0065 U+0301, should it be possible to position the cursor between the e and the acute? Should it be possible to delete the acute accent without deleting the e? A common sense approach would be to say no, to treat every glyph as indivisible. And as for inserting new characters by typing them, should they be entered into a document in NFC or NFD? This should be an editable preference setting.
It gets worse. Suppose you do a search for 'é'. Should it match U+00E9? Or U+0065 U+0301? Or both? In most normal circumstances, you'd want it to match either, but there may be special circumstances in which you'd like to search for a particular spelling of a glyph, so there would have to be a(nother) checkbox on the Search dialog to accommodate this.
(7) Compatibility Equivalence
There is another form of equivalence, called "compatibility" equivalence. In simple terms, one Unicode character is compatible with another (or with a sequence of others) if they look vaguely the same. (I've oversimplified I know - please don't pull me up on this). So, for example, the Unicode Trademark character is equivalent to "TM".
For this reason, there must be some item on the Edit menu which makes it possible to highlight a section of text and convert it to "compatibility" form. (Strictly speaking, there are two compatibility forms, called NFKC (the shortest encoding) and NFKD (the longest encoding), but it would suffice to let TextPad choose, on the basis of the user's preference between NFC and NFD). For searching, there should be yet another extra checkbox on the Search dialog - ON for a compatibility equivalent match, OFF for a normal match.
(8) Variable Width Characters - EVEN WITH A MONOSPACED FONT
When using a proportional font, we expect all characters to be different widths, so TAB stops are expressed in terms of pixels, but when using a fixed-width font, we expect all characters to be the same width, hence TAB stops are expressed in terms of character positions. Ok - let's use Unicode terminology here, "glyph positions". We imagine a text document to consist of a rectangular array of cells, with each glyph occupying exactly one cell (apart from TAB, which has to occupy more than cell in order to do its job). But some Unicode glyphs are just too damn big to fit in a single cell, so you're stuck with a choice between using bigger cells, or letting some characters occupy more than one cell. Well, the latter approach won the day, so some characters, Han syllables for instance, will occupy two or more successive character cells.
So when you're moving the cursor around the document, the behavior in the presence of wide characters should depend on the existing preference setting: Configure / Preferences / Editor / Constrain the cursor to the text. If constrained, the cursor may never end up inside a wide character. In this sense, they would act a little like TAB (though fixed width).
(9) Font Selection
Displaying a character is harder in Unicode. In current versions of TextPad, you configure a SINGLE font for each document class. This won't do for Unicode. No single font contains every single glyph. A new approach is needed. Instead, for each document class, you must select a SEQUENCE of fonts. Every time TextPad needs to display a glyph, it will try each font in turn, trying to find an image it can display. Eventually, if it runs out of fonts and still hasn't found the image, it may display an "unsupported character" glyph.
Doing this on a per-document-class basis is sensible, but possibly tedious. Other approaches may have to be considered, such as a per-document-class configuration file (like you have for syntax coloring).
(10) Inputting Characters
Microsoft does a reasonable job of Unicode input with its IMEs (Input Method Editor), but the problem with them is that in order to input an arbitrary Unicode character, you'd have to have the right IME installed. For most people, it would be better to have some other, more general method. TextPad already has clipbooks, of course, so it would be a relatively simple matter just to add a few more books to the clip library, containing every Unicode script. In addition, there should be a way of entering an arbitrary Unicode character given its codepoint in hexadecimal.
(11) Private Use Area Characters
This is the bit where you get to write in Klingon. Private Use characters are characters which the Unicode Consortium reserve for private use by consenting parties. So if I decide that I'm going to use a particular PUA character is going to represent (say) a silhouette of the starship Enterprise, and you are happy to go along with that, then I can write documents containing this character, and you can read them. (Nobody else will be able to read them though). In order to use the PUA, I have to provide two things: (1) fonts which allow my characters to be rendered, (2) some sort of input method, for example a clip book, and (3) some sort of table of character properties, so that all of the above algorithms can operate on my characters.
Strange as it may seem, some PUA conventions are already arising. One popular PUA "set" is known as CSUR, and contains, among other things, the Klingon alphabet. So, if you can had a "CSUR drop-in" (containing the CSUR fonts, clip book and character properties table), you could type documents in Klingon, send them to other people, and have them correctly understood.
CONCLUSION
If you want any or all of these enhancements in TextPad, the FIRST STEP is Unicode Conformance. That must come first. Full Unicode support is something too big to be rushed. It must be done slowly, carefully, one step at a time. You want this? Or even some of it? Vote for Unicode Conformance. That won't get you everything, but it's where everything starts.
Jill
(1) Case Conversion
From the Edit menu, you can click on Change Case, then you have five different options. But none of these case-change algorithms are true Unicode algorithms. For example, in Unicode, if you highlight the characters "aßa" and click on Edit / Change Case / Upper Case, they should change to "ASSA", because the single character 'ß' changes to the two characters 'SS' when uppercased. A true Unicode application should apply the Unicode ToUppercase algorithm (also ToLowercase and ToTitlecase), as documented in the Unicode Standard.
(2) Case-insensitive Comparison
Similarly, if you click on Search / Find (to get the find dialog) and leave the Match Case checkbox unchecked, we should ideally be doing a Unicode case-insensitive comparison. This means that a search for "assa" should match with "aßa". To achieve this, you need to apply the Unicode Casefolding algorithm. You also need to do some optimisation, because the Unicode Casefold Algorithm is presented in very non-optimal terms. (If you want to compare "ANT" with "elephant", for example, the casefolding algorithm says you should casefold both words, to give "ant" and "elephant", and then compare them, but an optimised implementation only needs to convert the first character!).
(3) Right-to-left Text
Not all text in the world flows from left to right. Some Unicode characters are expected to flow from right to left by default. What's more, you can embed right-to-left text inside left-to-right text, and vice versa. Some characters (for example open and close brackets) are expected to be mirror-reversed when rendered from right-to-left. The whole thing is defined by the Unicode Bidirectional Algorithm. This would be additional functionality for TextPad. In addition to acknowledging the default directionality of Unicode characters, it must also be possible for users to select a default direction for a document, or for a selected region.
(4) Sorting
Highlight some text, click on Tools / Sort. TextPad currently only offers you a lexicographical sort, which means (for example) that the following sequence is considered to be correctly sorted: 'E', 'F', 'e', 'f', 'É', 'é'. In fact, TextPad is sorting on the basis of the character code of each character, not on the basis of local collation criteria. It gets even worse if you ask for a case-insensitive sort. If you do this, our example ends up as: 'e', 'f', 'é'. This is not good. The correct way to do this is by using the Unicode Collation Algorithm. (This was only published literally a few weeks' ago, so I don't think ANY application fully supports this yet). To implement this in TextPad, the Sort dialog would need something of an overhaul. You would need to be able to specify things like whether or not accents are considered significant; whether uppercase sorts before or after lowercase, and so on. (Remember, Unicode sorting has NOTHING to do with codepoint order). All of this is documented in the UCA, and appropriate defaults could easily be set for each locale (for instance the defaults for French would be different from the defaults for English).
(5) Regular Expressions
Regular expressions take on a whole new dimension in Unicode. Expressions like "[[:alpha:]]" must now match Greek and Chinese letters as well as just Latin ones. Word boundaries ("/w" and "/W") are harder to calculate. You must be able to use the "\uHHHH" and "\U00HHHHHH" expressions in regular expressions. Unicode also has EXTRA features which you must be able to specify in regular expressions - if you want to match only against currency symbols you could specify "\p(Sc)"; if you want to match only against characters in the Greek script you could specify "\p(Greek)", and so on. All of this is documented in Unicode Technical Report #18. This would not be a minor enhancement, it would be a PHENOMENAL enhancement.
(6) Normalisation
In Unicode, there is a distinction between "characters" and "glyphs". A glyph is what gets displayed; a character is what gets stored internally. Confusingly, there is more than one way of "spelling" a glyph. For example, the glyph 'é' can be spelt as either the single character U+00E9 (e acute), or as the two-character-sequence U+0065 (e) followed by U+0301 (acute). If, within a given selection, all of your glyphs are stored CONSISTENTLY then the text is said to be "normalised". If everything is in its shortest possible form, it's called "Normalisation Form C" (NFC). If everything is in its longest possible form, it's called "Normalisation Form D" (NFD). NFC uses less space, but NFD makes most of the algorithms listed above go a lot faster.
The implications of this for a text editor are astounding. Text documents may legitimately contain a mixture of NFC, NFD and non-normalised text, and TextPad should not modify any of it without the explicit consent of the user. But on the Edit menu there should be the possibility of converting selected text (or the whole document) to either NFC or NFD. Then you have other decisions to make, such as, if you press the cursor-right button to step over 'é' spelt U+0065 U+0301, should it be possible to position the cursor between the e and the acute? Should it be possible to delete the acute accent without deleting the e? A common sense approach would be to say no, to treat every glyph as indivisible. And as for inserting new characters by typing them, should they be entered into a document in NFC or NFD? This should be an editable preference setting.
It gets worse. Suppose you do a search for 'é'. Should it match U+00E9? Or U+0065 U+0301? Or both? In most normal circumstances, you'd want it to match either, but there may be special circumstances in which you'd like to search for a particular spelling of a glyph, so there would have to be a(nother) checkbox on the Search dialog to accommodate this.
(7) Compatibility Equivalence
There is another form of equivalence, called "compatibility" equivalence. In simple terms, one Unicode character is compatible with another (or with a sequence of others) if they look vaguely the same. (I've oversimplified I know - please don't pull me up on this). So, for example, the Unicode Trademark character is equivalent to "TM".
For this reason, there must be some item on the Edit menu which makes it possible to highlight a section of text and convert it to "compatibility" form. (Strictly speaking, there are two compatibility forms, called NFKC (the shortest encoding) and NFKD (the longest encoding), but it would suffice to let TextPad choose, on the basis of the user's preference between NFC and NFD). For searching, there should be yet another extra checkbox on the Search dialog - ON for a compatibility equivalent match, OFF for a normal match.
(8) Variable Width Characters - EVEN WITH A MONOSPACED FONT
When using a proportional font, we expect all characters to be different widths, so TAB stops are expressed in terms of pixels, but when using a fixed-width font, we expect all characters to be the same width, hence TAB stops are expressed in terms of character positions. Ok - let's use Unicode terminology here, "glyph positions". We imagine a text document to consist of a rectangular array of cells, with each glyph occupying exactly one cell (apart from TAB, which has to occupy more than cell in order to do its job). But some Unicode glyphs are just too damn big to fit in a single cell, so you're stuck with a choice between using bigger cells, or letting some characters occupy more than one cell. Well, the latter approach won the day, so some characters, Han syllables for instance, will occupy two or more successive character cells.
So when you're moving the cursor around the document, the behavior in the presence of wide characters should depend on the existing preference setting: Configure / Preferences / Editor / Constrain the cursor to the text. If constrained, the cursor may never end up inside a wide character. In this sense, they would act a little like TAB (though fixed width).
(9) Font Selection
Displaying a character is harder in Unicode. In current versions of TextPad, you configure a SINGLE font for each document class. This won't do for Unicode. No single font contains every single glyph. A new approach is needed. Instead, for each document class, you must select a SEQUENCE of fonts. Every time TextPad needs to display a glyph, it will try each font in turn, trying to find an image it can display. Eventually, if it runs out of fonts and still hasn't found the image, it may display an "unsupported character" glyph.
Doing this on a per-document-class basis is sensible, but possibly tedious. Other approaches may have to be considered, such as a per-document-class configuration file (like you have for syntax coloring).
(10) Inputting Characters
Microsoft does a reasonable job of Unicode input with its IMEs (Input Method Editor), but the problem with them is that in order to input an arbitrary Unicode character, you'd have to have the right IME installed. For most people, it would be better to have some other, more general method. TextPad already has clipbooks, of course, so it would be a relatively simple matter just to add a few more books to the clip library, containing every Unicode script. In addition, there should be a way of entering an arbitrary Unicode character given its codepoint in hexadecimal.
(11) Private Use Area Characters
This is the bit where you get to write in Klingon. Private Use characters are characters which the Unicode Consortium reserve for private use by consenting parties. So if I decide that I'm going to use a particular PUA character is going to represent (say) a silhouette of the starship Enterprise, and you are happy to go along with that, then I can write documents containing this character, and you can read them. (Nobody else will be able to read them though). In order to use the PUA, I have to provide two things: (1) fonts which allow my characters to be rendered, (2) some sort of input method, for example a clip book, and (3) some sort of table of character properties, so that all of the above algorithms can operate on my characters.
Strange as it may seem, some PUA conventions are already arising. One popular PUA "set" is known as CSUR, and contains, among other things, the Klingon alphabet. So, if you can had a "CSUR drop-in" (containing the CSUR fonts, clip book and character properties table), you could type documents in Klingon, send them to other people, and have them correctly understood.
CONCLUSION
If you want any or all of these enhancements in TextPad, the FIRST STEP is Unicode Conformance. That must come first. Full Unicode support is something too big to be rushed. It must be done slowly, carefully, one step at a time. You want this? Or even some of it? Vote for Unicode Conformance. That won't get you everything, but it's where everything starts.
Jill