Unicode?

General questions about using TextPad

Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard

Post Reply
User avatar
CWBillow
Posts: 110
Joined: Thu Nov 06, 2003 11:59 pm
Location: Chula Vista California
Contact:

Unicode?

Post by CWBillow »

This may seem real stupid, but:

If I edit a reg entry in Wordpad, and save it in Unicode format -- the only Unicode choice --, all is OK.

If I want to edit it in TextPad, is it then Inicode, Unicode (fig endlian) -- which is what? -- or UTF-8?

What the heck are the diffs? I've gotten so screwed up a couple times that it took me forever to finally get it saved right...

Regards,
Chuck Billow
ben_josephs
Posts: 2464
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

Unicode is a character encoding. Each Unicode character is assigned a number that can be stored in 21 bits.

There are a number of formats in which these 21-bit numbers can be stored and transmitted, including:

Code: Select all

UTF-32BE (32 bits, big-endian)
UTF-32LE (32 bits, little-endian)
UTF-16BE (16 bits, big-endian)
UTF-16LE (16 bits, little-endian)
UTF-8    (8 bits)
In UTF-32 (BE or LE), each character is represented as a single 32-bit (4-byte) number. In UTF-16 (BE or LE), each character is represented as a one or two 16-bit (2-byte) numbers. In UTF-8, each character is represented as one, two, three or four 8-bit (1-byte) numbers.

For the multi-byte representations (UTF-32 and UTF-16), the byte order for each character can be least significant byte first (LE, little-endian, Intel byte order) or most significant byte first (BE, big-endian, Sun byte order, network byte order).

For ASCII characters (whose values are in the range 0..127 and can be represented in a 7-bit number) the ASCII representation and UTF-8 representation are identical.

The representation and byte order can be indicated by the presence of a byte order mark (BOM - Unicode character U+FEFF) at the beginning of the text. In the above representations this is

Code: Select all

UTF-32BE   00 00 FE FF   
UTF-32LE   FF FE 00 00   
UTF-16BE   FE FF         
UTF-16LE   FF FE         
UTF-8      EF BB BF      
TextPad doesn't handle UTF-32 (BE or LE).
What TextPad (misleadingly) calls Unicode is UTF-16LE.
What TextPad (misleadingly) calls Unicode (big endian) is UTF-16BE.
What TextPad (correctly) calls UTF-8 is UTF-8.

For Windows registry files you need UTF-16LE with BOM. If you open an existing registry file in binary format, you will see that it begins with FF FE.
Select Configure | Preferences | Document Classes | <Class> | Write Unicode and UTF-8 BOM
or View | Document Properties | Preferences | Write Unicode and UTF-8 BOM
and use
File | Save As... | Encoding: Unicode
User avatar
CWBillow
Posts: 110
Joined: Thu Nov 06, 2003 11:59 pm
Location: Chula Vista California
Contact:

Post by CWBillow »

ben_josephs wrote:Unicode is a character encoding. Each Unicode character is assigned a number that can be stored in 21 bits.

There are a number of formats in which these 21-bit numbers can be stored and transmitted, including:

Code: Select all

UTF-32BE (32 bits, big-endian)
UTF-32LE (32 bits, little-endian)
UTF-16BE (16 bits, big-endian)
UTF-16LE (16 bits, little-endian)
UTF-8    (8 bits)
In UTF-32 (BE or LE), each character is represented as a single 32-bit (4-byte) number. In UTF-16 (BE or LE), each character is represented as a one or two 16-bit (2-byte) numbers. In UTF-8, each character is represented as one, two, three or four 8-bit (1-byte) numbers.

For the multi-byte representations (UTF-32 and UTF-16), the byte order for each character can be least significant byte first (LE, little-endian, Intel byte order) or most significant byte first (BE, big-endian, Sun byte order, network byte order).

For ASCII characters (whose values are in the range 0..127 and can be represented in a 7-bit number) the ASCII representation and UTF-8 representation are identical.

The representation and byte order can be indicated by the presence of a byte order mark (BOM - Unicode character U+FEFF) at the beginning of the text. In the above representations this is

Code: Select all

UTF-32BE   00 00 FE FF   
UTF-32LE   FF FE 00 00   
UTF-16BE   FE FF         
UTF-16LE   FF FE         
UTF-8      EF BB BF      
TextPad doesn't handle UTF-32 (BE or LE).
What TextPad (misleadingly) calls Unicode is UTF-16LE.
What TextPad (misleadingly) calls Unicode (big endian) is UTF-16BE.
What TextPad (correctly) calls UTF-8 is UTF-8.

For Windows registry files you need UTF-16LE with BOM. If you open an existing registry file in binary format, you will see that it begins with FF FE.
Select Configure | Preferences | Document Classes | <Class> | Write Unicode and UTF-8 BOM
or View | Document Properties | Preferences | Write Unicode and UTF-8 BOM
and use
File | Save As... | Encoding: Unicode
Ben:

That was not only thorough, but even for me understandable.

Thanks,
Chuck
Post Reply