Bug - file size mis-reported for Unicode

General questions about using TextPad

Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard

Post Reply
Rick Jones
Posts: 8
Joined: Wed Jun 25, 2003 8:52 am
Contact:

Bug - file size mis-reported for Unicode

Post by Rick Jones »

Is this the right place to report a specific bug? I couldn't see a "bugs" section.

Anyway, I've just noticed that the file properties dialog shows the wrong filesize for Unicode files. It's exactly half the actual size. I.e. Textpad is counting the characters correctly, but not allowing for the fact that Unicode characters are 2 bytes not 1.

So the character count shows correctly, but the filesize should give the size in bytes, which it doesn't.

Spotted it in 4.6, still there in 4.6.2.
Rick Jones
User avatar
MudGuard
Posts: 1295
Joined: Sun Mar 02, 2003 10:15 pm
Location: Munich, Germany
Contact:

Post by MudGuard »

How many bytes a Unicode character requires depends on the encoding.

As far as I know, Textpad does NOT support UTF-16 (which would indeed have 2 bytes for most characters).

As far as I know Textpad does have (limited) support for UTF-8 which uses 1 Byte for all characters of the (7-bit) ASCII set and 2 or more bytes for all other characters.

And for bug reporting: the feedback form under "Support" - "Feedback" would be the right place - but as I see it this is not a bug but a mis-understanding on your side concerning Unicode and Unicode encodings.
Rick Jones
Posts: 8
Joined: Wed Jun 25, 2003 8:52 am
Contact:

Post by Rick Jones »

I beg to differ, TP does support UTF-16.

I noticed the problem not because I was looking at encoding but at filesizes. TP's properties for the file I had opened said ~50k, but the actual disk file was ~100k. Then I noticed it said the encoding was Unicode, and a hex dump of the file showed it to be 2-byte Unicode throughout.

Hence my conclusion - TP computes the filesize by counting characters, and doesn't consider how many bytes each character comprises.

Anyway, I'll use the feedback page - thanks for pointing that out.
Rick Jones
User avatar
MudGuard
Posts: 1295
Joined: Sun Mar 02, 2003 10:15 pm
Location: Munich, Germany
Contact:

Post by MudGuard »

Go to Help - Help Topics - Content - How To - Work with Files - Unicode Files
and read this:
TextPad automatically detects 16-bit Unicode and UTF-8 encoded characters, when opening files. Unicode characters may be in "little endian" (Intel) or "big endian" (RISC) order, and the order is preserved when a file is saved.

Internally, these files are converted to single or double byte characters (DBCS), using the locale corresponding to the font script selected for the document class. For example, if the screen font for the Text document class is MS Mincho, with the script set to Japanese, Unicode characters in *.TXT files will be converted to the corresponding DBCS characters in code page 932.

WARNING: This means that it is only possible to edit, without data loss, files containing characters from the implied code page. Other characters will be converted into a system default character (normally "?"), if you confirm that is what you want to do.
Conclusion: Textpad is able to read some UTF-16 files but it will convert it to its own system...
Post Reply