filename extensions in the document classes

WayneCa · Post by **WayneCa** » Thu Oct 19, 2023 4:52 pm

I have a few questions:

Are the extension identifiers (*.ext) case insensitive, e.g. is *.ext the same as *.EXT?
If they are not case sensitive, I have the following issue:

I have a file extension, *.B09.
Files with the .B09 extension show as type Custom but files with the extension .b09 show up as type B09.
1. Why is this?
2. What can I do to correct it so all files with .B09 or .b09 show up as type B09?

Further info: I have the files saved as Mac line endings and UTF-8 encoding. This is because these are source code files to a BASIC programming language on a retro computer. The line endings on that computer are $0D, the same as the Mac paragraph ending. The source files are text (no strange characters), but there are certain "hi-res" characters that are not dealt with correctly in TextPad. Example:

two characters I'm using are 0xAE and 0xBE. Looking at the file in a hex editor shows that the characters are saved in the file as 0xC2AE and 0xC2BE, I can strip the C2s from the file, but to ensure they do not return I save the files as UTF-8. So, I have the document class definition set to save new files as UTF-8 with Mac line endings.

FWIW, the files with the .B09 extension are still treated as B09 files in Textpad, even though the document properties shows them as type Custom.

Update: I just looked at a new .b09 file I saved that showed up as type B09 when I saved it. Closing TextPad and reopening it shows the file as type Custom, so the B09 type is only recognized when saving a new file.

Post by **AmigoJack** » Thu Oct 19, 2023 7:16 pm

U+00AE and U+00BE are correctly encoded as 0xC2 0xAE and 0xC2 0xBE in UTF-8. Why are you using Unicode at all? Use "ANSI" or "DOS" as encoding and everything is saved as expected, too - U+00AE should be interpreted/displayed as "®" and U+00BE as "¾".

Keep in mind that you need to tell TextPad the encoding when loading the file, as there is virtually no way to tell ANSI from UTF-8 apart. Preferably start TextPad and press CTRL+O - that way you can select an encoding ("ANSI") to load your BASIC source code file properly.

WayneCa · Post by **WayneCa** » Sun Oct 22, 2023 6:17 pm

I know which characters they are supposed to be. It is what happens to them I am concerned about. I need for them to simply be a single character in the source file (AE or BE) and not have extra bytes appended to them just because windows. I don't think TextPad should be changing anything in a source file that the user didn't specifically change. I copy the source file from the retro computer system and TextPad adds the extra codes without even letting me know it did it or asking me if I wanted them changed. It already has a character (? in a box) for characters it doesn't understand, so why can't it just use that and leave the character as it was?

Also, I have found a work-around. I use the hex editor XVI32 and I can strip the extra characters out with it. That helps, but it is an extra step I could live without.

Also, I started using UTF-8 encoding because I got tired of TextPad complaining that the characters weren't ANSI. Using UTF-8 encoding stopped that.

Post by **AmigoJack** » Mon Oct 23, 2023 1:06 am

This can't be consistent:

Either your files are already wrongly encoded in UTF-8 (because 0xAE alone is not valid in UTF-8) or it's not UTF-8 to begin with.
TextPad changes nothing unless you save the file. Not looking at the encoding you use when saving the file is your fault - why does your own document class "B09" use a default encoding of UTF-8 instead of ANSI or DOS?
WayneCa wrote: ↑Thu Oct 19, 2023 4:52 pmLooking at the file in a hex editor shows that the characters are saved in the file as 0xC2AE and 0xC2BE, I can strip the C2s from the file, but to ensure they do not return I save the files as UTF-8.
I have no idea how you can remotely achieve what you wrote: if you break UTF-8 encoding (removing one byte per character) then you're back at step 1 (original file/text encoding) and TextPad will, as per UTF-8, again save both characters with 2 bytes each.

Have you even tried opening the file in a running TextPad instance? Why not attaching examples of your files to your post so we have a chance to reproduce your issue?

I created this file, having exactly 0xAE 0xBE. Opens just fine in TextPad 8.4.2, even recognizing it as ANSI. Pressing F12 to save it with a different filename, and upon inspection it has no UTF-8 encoding either - both files are identical byte wise. Line breaks are Windows, tho, but that shouldn't matter:

before.txt

(22 Bytes) Downloaded 105 times
after.png (6.33 KiB) Viewed 2081 times

WayneCa · Post by **WayneCa** » Wed Apr 03, 2024 2:46 pm

AmigoJack wrote: ↑Mon Oct 23, 2023 1:06 am This can't be consistent:

Either your files are already wrongly encoded in UTF-8 (because 0xAE alone is not valid in UTF-8) or it's not UTF-8 to begin with.

TextPad changes nothing unless you save the file. Not looking at the encoding you use when saving the file is your fault - why does your own document class "B09" use a default encoding of UTF-8 instead of ANSI or DOS?

In a previous response I stated: "Also, I started using UTF-8 encoding because I got tired of TextPad complaining that the characters weren't ANSI. Using UTF-8 encoding stopped that." I'm pretty sure the complaint was due to the characters being a single byte and not an integer value. based on what you said about C2 being a valid character in a previous response to me: "U+00AE and U+00BE are correctly encoded as 0xC2 0xAE and 0xC2 0xBE in UTF-8." However, using UTF-8 got rid of the not ANSI message I was getting.

AmigoJack wrote: ↑Mon Oct 23, 2023 1:06 am [*]
WayneCa wrote: ↑Thu Oct 19, 2023 4:52 pmLooking at the file in a hex editor shows that the characters are saved in the file as 0xC2AE and 0xC2BE, I can strip the C2s from the file, but to ensure they do not return I save the files as UTF-8.
I have no idea how you can remotely achieve what you wrote: if you break UTF-8 encoding (removing one byte per character) then you're back at step 1 (original file/text encoding) and TextPad will, as per UTF-8, again save both characters with 2 bytes each.[/list] Have you even tried opening the file in a running TextPad instance? Why not attaching examples of your files to your post so we have a chance to reproduce your issue?

I created this file, having exactly 0xAE 0xBE. Opens just fine in TextPad 8.4.2, even recognizing it as ANSI. Pressing F12 to save it with a different filename, and upon inspection it has no UTF-8 encoding either - both files are identical byte wise. Line breaks are Windows, tho, but that shouldn't matter:

before.txt

after.png

Yes, using ANSI does leave the characters as they were originally. But the editor complains about them not being ANSI and I don't want to keep seeing that message. Is there a way to get that message to stop being displayed?

Also, my other question has not been addressed. It seems the only time the B09 document type is applied to a document is when the file hasn't been saved before. Once it has been saved it always shows up as custom, whether I use the B09 or b09 extension names. Are the two synonymous, and how can I get TextPad to see them as B09 files instead of custom files?

Community

filename extensions in the document classes

filename extensions in the document classes

Re: filename extensions in the document classes

Re: filename extensions in the document classes

Re: filename extensions in the document classes

Re: filename extensions in the document classes