Character code viewer

Ideas for new features

Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard

Post Reply
IanOfYork
Posts: 107
Joined: Sat Nov 04, 2017 11:54 am
Location: York, England

Character code viewer

Post by IanOfYork »

If a single character is selected, add a function to display the hex value of the character.
Add that function to the context menu when right-clicking the selection.
This would help to discover which "funny" character has been added to a UTF-8 encoded text file.

Extend to display a pop-up hex view of a larger selection.
User avatar
AmigoJack
Posts: 500
Joined: Sun Oct 30, 2016 4:28 pm
Location: グリーン ヒル ゾーン
Contact:

Re: Character code viewer

Post by AmigoJack »

IanOfYork wrote: Fri Jun 21, 2024 8:27 pmthe hex value of the character
...
a UTF-8 encoded text
Those are 2 (if not 3) different things:
  1. The "hex value" of a character is subject to the encoding - in case of UTF-8 a character can be encoded in 1 or 2 or 3 or 4 bytes, resulting in such hex values.
  2. Different text encodings have different bytes/hex values - that's the whole point of text encodings. One character encoded in UTF-8 might need 4 bytes, but in UTF-16 only 2; the hex value/byte of a character encoded in Windows-1252 might be totally different in DOS-852.
  3. The code point of a character is probably what you mean by "character code" and "hex value" - but that is unbound to any encoding. However, Unicode is not the only code point table and may not always be what a user wants to know.
(Since by far not everything is a "character" - think of punctuation or spaces - the Unicode consortium uses the more generic term grapheme.)
IanOfYork
Posts: 107
Joined: Sat Nov 04, 2017 11:54 am
Location: York, England

Re: Character code viewer

Post by IanOfYork »

I think it safe to assume that the developers will have understood the point of the request.
User avatar
AmigoJack
Posts: 500
Joined: Sun Oct 30, 2016 4:28 pm
Location: グリーン ヒル ゾーン
Contact:

Re: Character code viewer

Post by AmigoJack »

From my experience as both user and programmer I know how ambiguous requests can be both made and understood - both sides think they hit the nail and don't think of what it could also mean/can also be meant (by simply lacking to know which other things exist, too). That's why I listed every possible way.

I (as both user and programmer) would want/provide both at once: displaying the code point (as per codepage) and the bytes - so for the grapheme Ø the information to be displayed should be
LATIN CAPITAL LETTER O WITH STROKE
code point U+00D8 in UTF-8
encoded as bytes "0xC3 0x89"
(The first line would be a bonus to identify the grapheme even more - the reference list can be found at https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt.)
ben_josephs
Posts: 2459
Joined: Sun Mar 02, 2003 9:22 pm

Re: Character code viewer

Post by ben_josephs »

AmigoJack wrote: Mon Jun 24, 2024 7:06 pm code point U+00D8 in UTF-8
encoded as bytes "0xC3 0x89"
The UTF-8 encoding of Unicode code point U+00D8 is 0xC3 0x98, not 0xC3 0x89. (They must end with the same hex digit.)

The layout above suggests that the value of a Unicode code point depends on the encoding used to store or transmit it. This might avoid confusion:
Unicode code point U+00D8
encoded in UTF-8 as 0xC3 0x98
Also:
encoded in UTF-16 as 0x00D8
encoded in UTF-16LE (as in Windows) as 0xD8 0x00
encoded in UTF-16BE (as (usually) on the net) as 0x00 0xD8
User avatar
AmigoJack
Posts: 500
Joined: Sun Oct 30, 2016 4:28 pm
Location: グリーン ヒル ゾーン
Contact:

Re: Character code viewer

Post by AmigoJack »

ben_josephs wrote: Tue Jun 25, 2024 1:06 pm
Unicode code point U+00D8
encoded in UTF-8 as 0xC3 0x98
How would that look if Windows-1252 is used? Then you can't use "Unicode" anymore. Or would you?

My suggestion was on purpose, because now it could look this way:
code point D8 in Windows-1252
encoded as byte "0xD8"
IBM 850 would then be:
code point 9D in IBM 850
encoded as byte "0x9D"
But I'm not that sure myself about how it should be displayed when both code point and encoded byte(s) don't differ.
ben_josephs
Posts: 2459
Joined: Sun Mar 02, 2003 9:22 pm

Re: Character code viewer

Post by ben_josephs »

My comment was primarily about this:
code point U+00D8 in UTF-8
Did you mean Unicode?
UTF-8 is not a character set.
And 0x00D8 on its own isn't a UTF-8 encoding of any code point, Unicode or otherwise. It could be the first of a pair of bytes, the second of which must be in the range 0x80..0xBF, encoding the code points U+0600..U+063F.
User avatar
AmigoJack
Posts: 500
Joined: Sun Oct 30, 2016 4:28 pm
Location: グリーン ヒル ゾーン
Contact:

Re: Character code viewer

Post by AmigoJack »

ben_josephs wrote: Wed Jun 26, 2024 10:02 am
code point U+00D8 in UTF-8
Did you mean Unicode?
The U in "UTF-8" already implies Unicode, that's why I didn't include the name/specification "Unicode" on its own, as it would have been repetitive.
ben_josephs wrote: Wed Jun 26, 2024 10:02 amUTF-8 is not a character set.
I neither wrote so - I just combined the Unicode table along with the hypothetical text encoding of the current file, as in: "the value of the selected grapheme is U+00D8 for the currently used Unicode Transformation Format 8-bit interpretation".
ben_josephs wrote: Wed Jun 26, 2024 10:02 amAnd 0x00D8 on its own isn't a UTF-8 encoding of any code point, Unicode or otherwise. It could be the first of a pair of bytes, the second of which must be in the range 0x80..0xBF, encoding the code points U+0600..U+063F.
That's all correct, but I never wrote "0x00D8" (a word value) and will never - I clearly used "U+00D8" for its code point (again: the U is for Unicode). Not sure why you insisted on writing all this - believe it or not, but I have a long experience with Unicode even before TextPad could handle it correctly and just made the one typo which you already figured out. Please check if you haven't misread my texts and now make false assumptions of me not being able to distinguish standards, code points, encodings, graphemes, bytes and endianess.

If you're that fond of all this then it should be rather easy to let TextPad also support CESU-8, MUTF-8 and WTF-8 - that would also be a good addition for hopefully more rare cases, but implementation wise it should be rather trivial.
Post Reply