Page 1 of 1

Problems with line terminations with all document types.

Posted: Sat May 18, 2019 3:09 am
by ineuw
I set my default document properties to save as PC and UTF-8, and it applies to all document types.

When opening a new document, the document properties indicates it to be so, but, when I copy and paste a page of text from a Wikipedia website, (edit view), the document properties (Alt-Enter) change to ANSI-1252 and PC. This is obvious when I paste text with characters like "éó".

How can I correct this?

Posted: Sat May 18, 2019 3:10 pm
by ben_josephs
The subject of your post refers to problems with line terminations, but the body of the post describes problems with the character encoding.

Either way, I can't reproduce this problem with TextPad 8.1.1 on Windows 10.

Are you checking the properties before or after you save the new file?

Posted: Sat May 18, 2019 7:55 pm
by ineuw
The properties are checked before and after saving a new file.

I was under the impression that page encoding and line termination are related/connected. After testing, I see that I am wrong. The line termination issue affected my work when working on the same documents in Linux. (on a dual boot desktop). I resolved this by setting Textpad line termination to Unix, since this does not affect my work in Windows.

However, I do have a problem with the page encoding. My default encoding is always UTF-8 with Unix line termination, but when I save an accented word like "Elisée", on reopening the same document the word changes to "Elisée" and the document encoding changes to 1252 - (ANSI - Latin 1). and I don't know what I am doing wrong.

Posted: Tue May 21, 2019 9:18 am
by ben_josephs
You wrote:
    The properties are checked before and after saving a new file.
and

    when I save an accented word like "Elisée", on reopening the same document ...

There is some ambiguity here.

The Unicode value of the character é is 0x00E9. The UTF-8 encoding of this value is the byte sequence 0xC3, 0xA9. The Windows Latin-1 decoding of these values is the character sequence é, which is what you are seeing.

In the absence of an explicit indication of the encoding of your text the editor must examine it and make a guess. If the text contains only a small proportion of non-ASCII characters the editor might conclude that the text is encoded in Windows Latin-1. That is what is happening here.

To solve this you could do one of these things:

    Increase the proportion of non-ASCII characters.
    But this is something you might have no control over.

    Include a byte order mark (BOM: Unicode 0xFEFF; UTF-8 0xEF, 0xBB, 0xBF) at the beginning of your document:
        File | Save As...
            Encoding: UTF-8        [X] UNICODE BOM
    But not all text-handling software is happy with a BOM at the beginning of the text.

    Save your session in a workspace and open the file by opening the workspace.
    This is probably the best solution.

Edit: Corrected typo.

Posted: Tue May 21, 2019 5:10 pm
by ineuw
ben_josephs, can't thank you enough for this explanation. It's very clear and concise.

Posted: Wed May 22, 2019 10:17 pm
by ineuw
ineuw wrote:ben_josephs, can't thank you enough for this explanation. It's very clear and concise.
Addendum: Your explanation about a single UTF-8 character in a document
is validated. It changes the code to 1252 (ANSI - Latin 1)

In another TP doc, in which there were a number of UTF-8 characters, the encoding remained as it is set in the Prefs = UTF-8.