LonelyPixel wrote:waiting for an explanation before I vote on this...
Okey Dokey, here goes...
When you click on Save As you get a dialog box. It contains a drop-down menu called "Encoding". One of these encodings is UTF-8. So far so good.
Also on the list are "Unicode" and "Unicode (big endian)". These are misnamed - Unicode is a character set, not an encoding. Ideally, TP should refer to these encodings by their correct names: UCS-2LE and UCS-2BE respectively (collectively known as UCS-2).
However, notably ABSENT from the list are UTF-16LE, UTF-16BE (collectively known as UTF-16), UTF-32LE and UTF-32BE (collectively known as UTF-32). All of these are important for saving Unicode in a file. Baraclese's suggestion may seem trivial, but it's a piece of cake to implement.
Given access to TP's source code, I could code all of these in less than ten minutes (half an hour if you want them tested). It's a trivial enhancement, and wouldn't significantly increase either TP's size or efficiency.
BUT ... since TextPad is not currently capable of storing Unicode characters which are not in the current Windows codepage, it's also an enhancement suggestion with decidedly limited usefulness.
(Technical note: TextPad doesn't interpret UTF-8 correctly when opening a file it didn't create either. This is easy to demonstrate by messing around with a binary file editor).
You asked what is UCS-4 good for? I shall explain. UCS-2 is the subset of Unicode consisting of the codepoints from U+0000 to U+FFFF inclusive. Each character is saved as precisely two bytes. However, Unicode doesn't stop at U+FFFF - it goes all the way up to U+10FFFF, so all of the characters between U+010000 and U+10FFFF are as inexpressible in UCS-2 as they are in ASCII. UCS-4, on the other hand, stores codepoints from U+00000000 to U+FFFFFFFF inclusive, saved as precisely four bytes per character. Thus, it can store every Unicode character ... as well as the very, very high codepoints beyond U+10FFFF which even Unicode doesn't claim. The UTF- formats are slightly different, in that they can all store every Unicode character. UTF-8 is a variable-byte-width encoding (each character takes 1, 2, 3 or 4 bytes), and UTF-16 is a variable-word-width encoding (each character takes 1 or 2 16-bit words). UTF-32 is effectively the same as UCS-4 except that codepoints above U+10FFFF are illegal.
I suggest that people vote for this because, as I said, it's a trivial enhancement, and the only difference you'd notice is a few extra choices on one particular pulldown menu.
However,
, I'd also like to point you in the direction of a more significant poll ...
Unicode Conformance ... which is an important and
non-trivial enhancement request. And, in point of fact, the suggestion of THIS thread is going to be pretty useless unless we get proper Unicode support (as in, not destroying characters)
first.