Typing UNICODE

mo · Post by mo » Sun Jun 20, 2004 4:23 am

OK I'm ready for the ribbing this will likely generate...how do you type a UNICODE character?

I have a custom document type set up that uses a UNICODE font and and has the UNICODE paramaters checked.

I can see the correct characters when I type in the "escape code" i.e. ṃ (That's &#7747;), and save the file as html and view in a browser, but what do I type in to get the encoded character to display in the source (in TextPad)? I have tried Alt+07747 and I get another character altogether. I have found a couple of the characters I need just by random typing, but nothing relates to the escape characters that I can see. Anyone can give me a hint?

Using Windows 2000Pro

ben_josephs · Post by **ben_josephs** » Sun Jun 20, 2004 7:13 am

Remember that TextPad is not a Unicode editor. It converts all characters internally to a character set that depends on the font and script of the document you are editing, and you are severely restricted in the range of characters you can use. Which font and script are you using?

What is an "escape code"? &#7747; (decimal) is U+1E43 (hexadecimal), which is LATIN SMALL LETTER M WITH DOT BELOW.

For Unicode editing I use Unipad (http://www.unipad.org/).

mo · Post by mo » Sun Jun 20, 2004 12:37 pm

bj,

Remember that TextPad is not a Unicode editor. It converts all characters internally to a character set that depends on the font and script of the document you are editing, and you are severely restricted in the range of characters you can use. Which font and script are you using?

What is an "escape code"? &#7747 (decimal) is U+1E43 (hexadecimal), which is LATIN SMALL LETTER M WITH DOT BELOW.

Thank you for this response. I am pretty much an absolute beginner at using UNICODE so terms may not be correct. What I called "escape characters" are your &#7747 or the hex number. Latin small letter m with underdot is what I want.

The font I have assigned is called Pali Times and it is supposed to be a UNICODE font. (I am sure it is as it displays properly in html when I use the decimal in the source). Also the source was reliable, I checked it out in Fontlab, and a couple of other UNICODE fonts give me the same results when I type.

The problem is we are working on a search tool that will search UNICODE and I need to provide some sample pages with the encoded characters and I do not know how to get them into the source.

I do not know what you mean by "script".

Because I need to keep working in non-unicode with html files I set up a special document type called "UNICODE" with the extension "*.uni", that uses html syntax highlighting and has the two UNICODE preferences checked, and has the Pali Times font assigned. I save an html file as .uni, open it in TEXTPAD to convert the characters to unicode.

Maybe I need a UNICODE editor? Any suggestions? edit: I see your UNIPAD suggestion which I am about to try out. I also downloaded one called UNIRED which seems simple but competant.

ben_josephs · Post by **ben_josephs** » Sun Jun 20, 2004 7:39 pm

I do not know what you mean by "script".

View | Document Properties | Font | Script

But, even when using a font (Arial Unicode MS) that contains the character U+1E43, I can't find a script in which it's displayed correctly.

I do not fully understand what you are trying to do. But it seems that you want to be able to display all the characters in a Unicode font. TextPad cannot do this. I've already mentioned Unipad ([url]www://www.unipad.org[/url]), but this uses only its built-in font.

mo · Post by mo » Sun Jun 20, 2004 8:15 pm

bj,

Thanks again. I downloaded the trial version of UNIPAD.

The fact that I cannot use a custom font is a problem which makes this tool not the one. In reading the help files however I see that this business of entering UNICODE characters is not a simple thing.

How is it possible that these guys have gotten this far in terms of pushing this standard without a way to easily insert the characters! Going to a character selection display and clicking a character to insert it will not do!

Anyway, what it appears that I may be required to do is to create the files in TEXTPAD (using what this guy calls the "escape code" characters -- he has a long explanation of why he thinks that is a good term for these characters that are otherwise known as "numeric
character references or symbolic character references"), save them as htm. Open them in UniRed and save them with that program in my custom font. I'm not even sure that will work. This is some kind of nightmare development situation. I am working in Windows, the developer is working in Mac and we are trying to create a UNICODE search engine.

I have gained a little further information: WordPad accepts input of Unicode characters in the form of Alt+07747.

So clearly that is what is needed in TEXTPAD.

boldan · Post by **boldan** » Mon Jun 28, 2004 3:51 pm

Having just finished a clip library for Excel functions (hint, hint

) I suggest creating a clip library for entering Unicode codes. They will NOT display as Unicode characters, but you'll be able to enter them with textPad and see them with an HTML editor.

mo · Post by mo » Mon Jun 28, 2004 6:16 pm

Thanks boldan,

I have a good way to enter the Numeric Character Encodings -- I use a macro keypad.

The issue, aside from the editors themselves seems to be the need in this case to mix what is in TextPad called scripts. I need to be able to see the characters from language groups that do not go together.

This is something that UNICODE doesn't seem to address, but is a situation which is encountered in many situations where language is the subject, as opposed to beging the way some other subject is discussed.
It is going to have to be understood and accommodated somehow.

Meanwhile the solution seems to be some sort of pre-processing into an ASCII convention (e.g. lc m underdot displayed as ampersand#7747; = .m) during indexing, and for the search, then conversion back for the results.

mjb71 · Post by **mjb71** » Tue Jul 13, 2004 11:56 am

You need some help understanding some terminology here.

1. character - a unit in a writing system. An example of a character is the concept of "Latin capital letter A", not an actual instance of it like the one you see here --> A. Other kinds of characters include digits, punctuation, diacritics.

2. script - a writing system. In order to write English, French, German, or pretty much any other Western or Central European language (+ a few others), one uses the Latin (a.k.a. "Roman") script. You can think of a script as just an alphabet, but it contains more than just letters and is more than just a set of characters. Other scripts include Cyrillic (used for Eastern European and Russian languages), Arabic, Indic, and various others. They all encompass not only different 'repertoires' of characters, but also rules for how to write them. For example in English we have a certain alphabet (plus certain numeric digits, punctuation characters, and less common diacritics). We also capitalize the beginning of sentences, proper nouns, personal titles, etc., and we use a period followed by a bit of white space to represent a 'full stop' at the end of a sentence, and we use commas in certain ways, we arrange everything left-to-right in rows starting at the top, etc. To the extent that that stuff is used in writing the language, it is all part of the Latin script. When writing French there are a few more alphabetic characters and many more diacritics, but otherwise it's very much the same and is derived from Latin as well, so it's part of the Latin script, too. Got it?

3. Unicode - a mapping of all known characters used in every human language to non-negative integer numbers, thus allowing any character to be represented by its number (its 'code point'). The code points run from 0 to 1,114,111. There are unique characters assigned to about 100,000 of those, mostly in the lowest part of that range. In prose like this paragraph, there is a notation that you can use to talk about any character via its code point: U+nnnn, where nnnn is the code point, in hexadecimal (base-16), padded to 4 digits wide when the code point is between 0 and 65,535 (FFFF in hexadecimal), and not padded when it's beyond that range.

4. numeric character reference - in a markup language like HTML, XML, or their common source, SGML, a document consists of a series of characters. Some sequences of characters within the document have special meaning as markup: for example the less-than character ("<", U+003C) introduces a 'tag', while the ampersand character followed by the number sign ("&#") introduces a numeric character reference, which is just an alternative way of entering a character. NCRs have the form &#decimalUnicodeCodepoint; or &#xhexadecimalUnicodeCodepoint; -- so, for example, U+00A0 (the 'no-break space' character) can be represented as either &# 160; (without the space; sorry, the message board eats it otherwise) or   (case-insensitive). On top of that, the markup languages have a notion of named 'entities', which are essentially aliases for sequences of 1 or more characters. Some entities are built-in and represent single characters: the entity named 'nbsp' exists in HTML, for example, and is in fact an alias for &# 160; (again, without the space) which in turn is an alias for the single no-break space character. As you might've guessed,   is an entity reference, specifically a character entity reference because it is an entity that represents one character. Remember: if you see '&#' it is NEVER the beginning of an 'entity reference'. It is ALWAYS a 'numeric character reference'.

4. byte - 8 bits; an 8-digit-wide binary number. See, we humans like to count with a 10-digit (base-10 or "decimal") system: we use digits 0 through 9, and the least-significant digit comes last. 15 means fifteen because it's a 1 in the more significant 'tens' column and a 5 in the less signifcant 'ones' column. Computers like to use a 2-digit (base-2 or "binary") system: using only digits 0 and 1, but otherwise the same. So on the right there's a 'ones' column, then a 'twos' column, 'fours', 'eights', 'sixteens', and so on up to 'hundred twenty-eights', a binary number that is 8 columns wide and filled with 1s (11111111) represents the value two hundred fifty-five. An 8-bit number (a 'byte') can therefore represent 256 values (0 through 255, in decimal).

5. glyph - a visual representation of a character. Since some characters are represented in different ways when they are used in different combinations or in different writing systems, an operating system has to be smart about how it makes use of glyphs. It's also hard to show an example glyph for each character since not all characters are graphic. Here are 3 glyphs for Latin capital letter A: A A A.

6. font - the bridge between glyphs and characters in an OS. It is a collection of glyphs, or instructions for rendering them, in a (hopefully) uniform style. Fonts nowadays are "Unicode" in the sense that they map Unicode code points to glyphs. The OS looks in the font for the code point of what character it wants to render, and gets the basic instructions for how to render it. The OS typically has to know more about how to use those instructions, as I mentioned above. Also, how many code points and glyphs are in a font is another matter. Typically, a font only contains glyphs for most of the characters needed by a very small number of scripts. You can't write Japanese with Times New Roman, for instance; the font just doesn't contain those characters. Generally when someone says a font is "Unicode" they really mean that it contains characters for many scripts, including both European and East Asian scripts, but maybe (likely) not the whole of Unicode.

7. OK, now you have almost all the info you need. The last thing to understand is that when text is written to disk or transmitted over a network, it is always encoded. Encoding, in simplest terms, means the conversion of characters to bytes. There are many encoding schemes out there, and they are often called "character sets", although the terminology preferred by Unicode is more precise. Some examples of encodings / character sets: "iso-8859-1" maps U+0000 through U+00FF to bytes 0 through FF (hex), just one byte per character, a total of 256 characters being represented (so not very much of the full extent of Unicode can be represented with it!); "windows-1252" is the same, but bytes 80-9F (hex) correspond to characters from some of the upper reaches of Unicode, e.g. the Euro symbol, left and right double quotation marks, the TM trademark symbol, etc. There are only a few encodings that handle all of Unicode (U+0000 through U+10FFFF). The most common are UTF-8 and UTF-16. UTF-8 maps each code point / character to a sequence of 1 to 4 bytes in a fixed order. UTF-16 maps each code point / character to a sequence of, essentially, either 2 or 4 bytes in a platform-dependent order (i386: little-endian, SPARC: big-endian). UTF-8 and UTF-16 also optionally include some preface bytes called a signature or byte-order mark. Fun!

Your OS and filesystem decide what encoding gets used by default, usually. On English versions of Windows, if you haven't messed with your settings too much, you are probably using windows-1252 as the default. This is normally transparent to you because you're usually only writing things in English and rarely need characters outside the range supported by windows-1252.

Now for the heartbreaker: TextPad is not a "Unicode" application. Specifically, it relies on the Windows APIs that work with the Windows 9x family, rather than with the newer ones that are used in NT, 2000, XP, and Server 2003. The function libraries for rendering text on these systems would let it work with something other than windows-1252, at the expense of not working at all on Windows 95, 98, or Me.

TextPad is actually smart enough to let you load and save files in different encodings, but the editor itself is only letting you work with individual bytes, which are shown to you as characters after having been decoded according to windows-1252. It simply cannot let you type a character that is outside of windows-1252 (it is Windows that eats it, not TextPad), and when it loads such characters from a file, it has to convert their bytes to Windows-1252 characters, essentially.

Yes, maybe the font you're using in the editor window supports Japanese characters, but that makes no difference when TextPad uses the non-Unicode APIs. TextPad just pumps bytes to Windows functions that convert them to characters according to windows-1252 and render them accordingly. TextPad never gets the chance to say "these bytes are UTF-8" or "please render character U+EA42".

If you look in the Feature Request forum, you'll see that full Unicode support has been requested many times before and is pretty high up on the list of most-desired features. Go cast your vote!

Community

Typing UNICODE

Typing UNICODE

Entering Unicode