Unicode is a character encoding. Each Unicode character is assigned a number that can be stored in 21 bits.
There are a number of formats in which these 21-bit numbers can be stored and transmitted, including:
Code: Select all
UTF-32BE (32 bits, big-endian)
UTF-32LE (32 bits, little-endian)
UTF-16BE (16 bits, big-endian)
UTF-16LE (16 bits, little-endian)
UTF-8 (8 bits)
In UTF-32 (BE or LE), each character is represented as a single 32-bit (4-byte) number. In UTF-16 (BE or LE), each character is represented as a one or two 16-bit (2-byte) numbers. In UTF-8, each character is represented as one, two, three or four 8-bit (1-byte) numbers.
For the multi-byte representations (UTF-32 and UTF-16), the byte order for each character can be least significant byte first (LE, little-endian, Intel byte order) or most significant byte first (BE, big-endian, Sun byte order, network byte order).
For ASCII characters (whose values are in the range 0..127 and can be represented in a 7-bit number) the ASCII representation and UTF-8 representation are identical.
The representation and byte order can be indicated by the presence of a byte order mark (BOM - Unicode character U+FEFF) at the beginning of the text. In the above representations this is
Code: Select all
UTF-32BE 00 00 FE FF
UTF-32LE FF FE 00 00
UTF-16BE FE FF
UTF-16LE FF FE
UTF-8 EF BB BF
TextPad doesn't handle UTF-32 (BE or LE).
What TextPad (misleadingly) calls
Unicode is UTF-16LE.
What TextPad (misleadingly) calls
Unicode (big endian) is UTF-16BE.
What TextPad (correctly) calls
UTF-8 is UTF-8.
For Windows registry files you need UTF-16LE with BOM. If you open an existing registry file in binary format, you will see that it begins with
FF FE.
Select
Configure | Preferences | Document Classes | <Class> | Write Unicode and UTF-8 BOM
or
View | Document Properties | Preferences | Write Unicode and UTF-8 BOM
and use
File | Save As... | Encoding: Unicode