Character code viewer

IanOfYork · Post by **IanOfYork** » Fri Jun 21, 2024 8:27 pm

If a single character is selected, add a function to display the hex value of the character.
Add that function to the context menu when right-clicking the selection.
This would help to discover which "funny" character has been added to a UTF-8 encoded text file.

Extend to display a pop-up hex view of a larger selection.

Post by **AmigoJack** » Sat Jun 22, 2024 7:52 pm

IanOfYork wrote: ↑Fri Jun 21, 2024 8:27 pmthe hex value of the character
...
a UTF-8 encoded text

Those are 2 (if not 3) different things:

The "hex value" of a character is subject to the encoding - in case of UTF-8 a character can be encoded in 1 or 2 or 3 or 4 bytes, resulting in such hex values.
Different text encodings have different bytes/hex values - that's the whole point of text encodings. One character encoded in UTF-8 might need 4 bytes, but in UTF-16 only 2; the hex value/byte of a character encoded in Windows-1252 might be totally different in DOS-852.
The code point of a character is probably what you mean by "character code" and "hex value" - but that is unbound to any encoding. However, Unicode is not the only code point table and may not always be what a user wants to know.

(Since by far not everything is a "character" - think of punctuation or spaces - the Unicode consortium uses the more generic term grapheme.)

IanOfYork · Post by **IanOfYork** » Mon Jun 24, 2024 2:26 pm

I think it safe to assume that the developers will have understood the point of the request.

Post by **AmigoJack** » Mon Jun 24, 2024 7:06 pm

From my experience as both user and programmer I know how ambiguous requests can be both made and understood - both sides think they hit the nail and don't think of what it could also mean/can also be meant (by simply lacking to know which other things exist, too). That's why I listed every possible way.

I (as both user and programmer) would want/provide both at once: displaying the code point (as per codepage) and the bytes - so for the grapheme Ø the information to be displayed should be

LATIN CAPITAL LETTER O WITH STROKE
code point U+00D8 in UTF-8
encoded as bytes "0xC3 0x89"

(The first line would be a bonus to identify the grapheme even more - the reference list can be found at https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt.)

ben_josephs · Post by **ben_josephs** » Tue Jun 25, 2024 1:06 pm

AmigoJack wrote: ↑Mon Jun 24, 2024 7:06 pm code point U+00D8 in UTF-8
encoded as bytes "0xC3 0x89"

The UTF-8 encoding of Unicode code point U+00D8 is 0xC3 0x98, not 0xC3 0x89. (They must end with the same hex digit.)

The layout above suggests that the value of a Unicode code point depends on the encoding used to store or transmit it. This might avoid confusion:

Unicode code point U+00D8
encoded in UTF-8 as 0xC3 0x98

Also:

encoded in UTF-16 as 0x00D8
encoded in UTF-16LE (as in Windows) as 0xD8 0x00
encoded in UTF-16BE (as (usually) on the net) as 0x00 0xD8

Post by **AmigoJack** » Tue Jun 25, 2024 3:05 pm

ben_josephs wrote: ↑Tue Jun 25, 2024 1:06 pm
Unicode code point U+00D8
encoded in UTF-8 as 0xC3 0x98

How would that look if Windows-1252 is used? Then you can't use "Unicode" anymore. Or would you?

My suggestion was on purpose, because now it could look this way:

code point D8 in Windows-1252
encoded as byte "0xD8"

IBM 850 would then be:

code point 9D in IBM 850
encoded as byte "0x9D"

But I'm not that sure myself about how it should be displayed when both code point and encoded byte(s) don't differ.

ben_josephs · Post by **ben_josephs** » Wed Jun 26, 2024 10:02 am

My comment was primarily about this:

code point U+00D8 in UTF-8

Did you mean Unicode?
UTF-8 is not a character set.
And 0x00D8 on its own isn't a UTF-8 encoding of any code point, Unicode or otherwise. It could be the first of a pair of bytes, the second of which must be in the range 0x80..0xBF, encoding the code points U+0600..U+063F.

Post by **AmigoJack** » Wed Jun 26, 2024 11:29 am

ben_josephs wrote: ↑Wed Jun 26, 2024 10:02 am
code point U+00D8 in UTF-8
Did you mean Unicode?

The U in "UTF-8" already implies Unicode, that's why I didn't include the name/specification "Unicode" on its own, as it would have been repetitive.

ben_josephs wrote: ↑Wed Jun 26, 2024 10:02 amUTF-8 is not a character set.

I neither wrote so - I just combined the Unicode table along with the hypothetical text encoding of the current file, as in: "the value of the selected grapheme is U+00D8 for the currently used Unicode Transformation Format 8-bit interpretation".

ben_josephs wrote: ↑Wed Jun 26, 2024 10:02 amAnd 0x00D8 on its own isn't a UTF-8 encoding of any code point, Unicode or otherwise. It could be the first of a pair of bytes, the second of which must be in the range 0x80..0xBF, encoding the code points U+0600..U+063F.

That's all correct, but I never wrote "0x00D8" (a word value) and will never - I clearly used "U+00D8" for its code point (again: the U is for Unicode). Not sure why you insisted on writing all this - believe it or not, but I have a long experience with Unicode even before TextPad could handle it correctly and just made the one typo which you already figured out. Please check if you haven't misread my texts and now make false assumptions of me not being able to distinguish standards, code points, encodings, graphemes, bytes and endianess.

If you're that fond of all this then it should be rather easy to let TextPad also support CESU-8, MUTF-8 and WTF-8 - that would also be a good addition for hopefully more rare cases, but implementation wise it should be rather trivial.

Community

Character code viewer

Character code viewer

Re: Character code viewer

Re: Character code viewer

Re: Character code viewer

Re: Character code viewer

Re: Character code viewer

Re: Character code viewer

Re: Character code viewer