Problems with line terminations with all document types.

ineuw · Post by **ineuw** » Sat May 18, 2019 3:09 am

I set my default document properties to save as PC and UTF-8, and it applies to all document types.

When opening a new document, the document properties indicates it to be so, but, when I copy and paste a page of text from a Wikipedia website, (edit view), the document properties (Alt-Enter) change to ANSI-1252 and PC. This is obvious when I paste text with characters like "ÃƒÂ©ÃƒÂ³".

How can I correct this?

ben_josephs · Post by **ben_josephs** » Sat May 18, 2019 3:10 pm

The subject of your post refers to problems with line terminations, but the body of the post describes problems with the character encoding.

Either way, I can't reproduce this problem with TextPad 8.1.1 on Windows 10.

Are you checking the properties before or after you save the new file?

ineuw · Post by **ineuw** » Sat May 18, 2019 7:55 pm

The properties are checked before and after saving a new file.

I was under the impression that page encoding and line termination are related/connected. After testing, I see that I am wrong. The line termination issue affected my work when working on the same documents in Linux. (on a dual boot desktop). I resolved this by setting Textpad line termination to Unix, since this does not affect my work in Windows.

However, I do have a problem with the page encoding. My default encoding is always UTF-8 with Unix line termination, but when I save an accented word like "ElisÃƒÂ©e", on reopening the same document the word changes to "ElisÃƒÆ’Ã‚Â©e" and the document encoding changes to 1252 - (ANSI - Latin 1). and I don't know what I am doing wrong.

ben_josephs · Post by **ben_josephs** » Tue May 21, 2019 9:18 am

You wrote:
Ã‚Â Ã‚Â Ã‚Â Ã‚Â The properties are checked before and after saving a new file.
and
Ã‚Â Ã‚Â Ã‚Â Ã‚Â when I save an accented word like "ElisÃƒÂ©e", on reopening the same document ...

There is some ambiguity here.

The Unicode value of the character ÃƒÂ© is 0x00E9. The UTF-8 encoding of this value is the byte sequence 0xC3, 0xA9. The Windows Latin-1 decoding of these values is the character sequence ÃƒÆ’Ã‚Â©, which is what you are seeing.

In the absence of an explicit indication of the encoding of your text the editor must examine it and make a guess. If the text contains only a small proportion of non-ASCII characters the editor might conclude that the text is encoded in Windows Latin-1. That is what is happening here.

To solve this you could do one of these things:

Ã‚Â Ã‚Â Ã‚Â Ã‚Â Increase the proportion of non-ASCII characters.
Ã‚Â Ã‚Â Ã‚Â Ã‚Â But this is something you might have no control over.

Ã‚Â Ã‚Â Ã‚Â Ã‚Â Include a byte order mark (BOM: Unicode 0xFEFF; UTF-8 0xEF, 0xBB, 0xBF) at the beginning of your document:
Ã‚Â Ã‚Â Ã‚Â Ã‚Â Ã‚Â Ã‚Â Ã‚Â Ã‚Â File | Save As...
Ã‚Â Ã‚Â Ã‚Â Ã‚Â Ã‚Â Ã‚Â Ã‚Â Ã‚Â Ã‚Â Ã‚Â Ã‚Â Ã‚Â Encoding: UTF-8Ã‚Â Ã‚Â Ã‚Â Ã‚Â Ã‚Â Ã‚Â Ã‚Â Ã‚Â [X] UNICODE BOM
Ã‚Â Ã‚Â Ã‚Â Ã‚Â But not all text-handling software is happy with a BOM at the beginning of the text.

Ã‚Â Ã‚Â Ã‚Â Ã‚Â Save your session in a workspace and open the file by opening the workspace.
Ã‚Â Ã‚Â Ã‚Â Ã‚Â This is probably the best solution.

Edit: Corrected typo.

ineuw · Post by **ineuw** » Tue May 21, 2019 5:10 pm

ben_josephs, can't thank you enough for this explanation. It's very clear and concise.

ineuw · Post by **ineuw** » Wed May 22, 2019 10:17 pm

ineuw wrote:ben_josephs, can't thank you enough for this explanation. It's very clear and concise.

Addendum: Your explanation about a single UTF-8 character in a document
is validated. It changes the code to 1252 (ANSI - Latin 1)

In another TP doc, in which there were a number of UTF-8 characters, the encoding remained as it is set in the Prefs = UTF-8.