If a single character is selected, add a function to display the hex value of the character.
Add that function to the context menu when right-clicking the selection.
This would help to discover which "funny" character has been added to a UTF-8 encoded text file.
Extend to display a pop-up hex view of a larger selection.
Character code viewer
Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard
Re: Character code viewer
Those are 2 (if not 3) different things:
- The "hex value" of a character is subject to the encoding - in case of UTF-8 a character can be encoded in 1 or 2 or 3 or 4 bytes, resulting in such hex values.
- Different text encodings have different bytes/hex values - that's the whole point of text encodings. One character encoded in UTF-8 might need 4 bytes, but in UTF-16 only 2; the hex value/byte of a character encoded in Windows-1252 might be totally different in DOS-852.
- The code point of a character is probably what you mean by "character code" and "hex value" - but that is unbound to any encoding. However, Unicode is not the only code point table and may not always be what a user wants to know.
Re: Character code viewer
I think it safe to assume that the developers will have understood the point of the request.
Re: Character code viewer
From my experience as both user and programmer I know how ambiguous requests can be both made and understood - both sides think they hit the nail and don't think of what it could also mean/can also be meant (by simply lacking to know which other things exist, too). That's why I listed every possible way.
I (as both user and programmer) would want/provide both at once: displaying the code point (as per codepage) and the bytes - so for the grapheme Ø the information to be displayed should be
I (as both user and programmer) would want/provide both at once: displaying the code point (as per codepage) and the bytes - so for the grapheme Ø the information to be displayed should be
(The first line would be a bonus to identify the grapheme even more - the reference list can be found at https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt.)LATIN CAPITAL LETTER O WITH STROKE
code point U+00D8 in UTF-8
encoded as bytes "0xC3 0x89"
-
- Posts: 2461
- Joined: Sun Mar 02, 2003 9:22 pm
Re: Character code viewer
The UTF-8 encoding of Unicode code point U+00D8 is 0xC3 0x98, not 0xC3 0x89. (They must end with the same hex digit.)
The layout above suggests that the value of a Unicode code point depends on the encoding used to store or transmit it. This might avoid confusion:
Also:Unicode code point U+00D8
encoded in UTF-8 as 0xC3 0x98
encoded in UTF-16 as 0x00D8
encoded in UTF-16LE (as in Windows) as 0xD8 0x00
encoded in UTF-16BE (as (usually) on the net) as 0x00 0xD8
Re: Character code viewer
How would that look if Windows-1252 is used? Then you can't use "Unicode" anymore. Or would you?
My suggestion was on purpose, because now it could look this way:
IBM 850 would then be:code point D8 in Windows-1252
encoded as byte "0xD8"
But I'm not that sure myself about how it should be displayed when both code point and encoded byte(s) don't differ.code point 9D in IBM 850
encoded as byte "0x9D"
-
- Posts: 2461
- Joined: Sun Mar 02, 2003 9:22 pm
Re: Character code viewer
My comment was primarily about this:
UTF-8 is not a character set.
And 0x00D8 on its own isn't a UTF-8 encoding of any code point, Unicode or otherwise. It could be the first of a pair of bytes, the second of which must be in the range 0x80..0xBF, encoding the code points U+0600..U+063F.
Did you mean Unicode?code point U+00D8 in UTF-8
UTF-8 is not a character set.
And 0x00D8 on its own isn't a UTF-8 encoding of any code point, Unicode or otherwise. It could be the first of a pair of bytes, the second of which must be in the range 0x80..0xBF, encoding the code points U+0600..U+063F.
Re: Character code viewer
The U in "UTF-8" already implies Unicode, that's why I didn't include the name/specification "Unicode" on its own, as it would have been repetitive.
I neither wrote so - I just combined the Unicode table along with the hypothetical text encoding of the current file, as in: "the value of the selected grapheme is U+00D8 for the currently used Unicode Transformation Format 8-bit interpretation".
That's all correct, but I never wrote "0x00D8" (a word value) and will never - I clearly used "U+00D8" for its code point (again: the U is for Unicode). Not sure why you insisted on writing all this - believe it or not, but I have a long experience with Unicode even before TextPad could handle it correctly and just made the one typo which you already figured out. Please check if you haven't misread my texts and now make false assumptions of me not being able to distinguish standards, code points, encodings, graphemes, bytes and endianess.ben_josephs wrote: ↑Wed Jun 26, 2024 10:02 amAnd 0x00D8 on its own isn't a UTF-8 encoding of any code point, Unicode or otherwise. It could be the first of a pair of bytes, the second of which must be in the range 0x80..0xBF, encoding the code points U+0600..U+063F.
If you're that fond of all this then it should be rather easy to let TextPad also support CESU-8, MUTF-8 and WTF-8 - that would also be a good addition for hopefully more rare cases, but implementation wise it should be rather trivial.