Bug - file size mis-reported for Unicode

Rick Jones · Post by **Rick Jones** » Wed Jun 25, 2003 9:00 am

Is this the right place to report a specific bug? I couldn't see a "bugs" section.

Anyway, I've just noticed that the file properties dialog shows the wrong filesize for Unicode files. It's exactly half the actual size. I.e. Textpad is counting the characters correctly, but not allowing for the fact that Unicode characters are 2 bytes not 1.

So the character count shows correctly, but the filesize should give the size in bytes, which it doesn't.

Spotted it in 4.6, still there in 4.6.2.

Post by **MudGuard** » Wed Jun 25, 2003 9:46 am

How many bytes a Unicode character requires depends on the encoding.

As far as I know, Textpad does NOT support UTF-16 (which would indeed have 2 bytes for most characters).

As far as I know Textpad does have (limited) support for UTF-8 which uses 1 Byte for all characters of the (7-bit) ASCII set and 2 or more bytes for all other characters.

And for bug reporting: the feedback form under "Support" - "Feedback" would be the right place - but as I see it this is not a bug but a mis-understanding on your side concerning Unicode and Unicode encodings.

Rick Jones · Post by **Rick Jones** » Wed Jun 25, 2003 11:13 am

I beg to differ, TP does support UTF-16.

I noticed the problem not because I was looking at encoding but at filesizes. TP's properties for the file I had opened said ~50k, but the actual disk file was ~100k. Then I noticed it said the encoding was Unicode, and a hex dump of the file showed it to be 2-byte Unicode throughout.

Hence my conclusion - TP computes the filesize by counting characters, and doesn't consider how many bytes each character comprises.

Anyway, I'll use the feedback page - thanks for pointing that out.

Post by **MudGuard** » Wed Jun 25, 2003 12:51 pm

Go to Help - Help Topics - Content - How To - Work with Files - Unicode Files
and read this:

TextPad automatically detects 16-bit Unicode and UTF-8 encoded characters, when opening files. Unicode characters may be in "little endian" (Intel) or "big endian" (RISC) order, and the order is preserved when a file is saved.

Internally, these files are converted to single or double byte characters (DBCS), using the locale corresponding to the font script selected for the document class. For example, if the screen font for the Text document class is MS Mincho, with the script set to Japanese, Unicode characters in *.TXT files will be converted to the corresponding DBCS characters in code page 932.

WARNING: This means that it is only possible to edit, without data loss, files containing characters from the implied code page. Other characters will be converted into a system default character (normally "?"), if you confirm that is what you want to do.

Conclusion: Textpad is able to read some UTF-16 files but it will convert it to its own system...