I set my default document properties to save as PC and UTF-8, and it applies to all document types.
When opening a new document, the document properties indicates it to be so, but, when I copy and paste a page of text from a Wikipedia website, (edit view), the document properties (Alt-Enter) change to ANSI-1252 and PC. This is obvious when I paste text with characters like "éó".
How can I correct this?
Problems with line terminations with all document types.
Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard
Problems with line terminations with all document types.
TextPad 8.16.0 64bit in English and TextPad 9.1.0 64bit in French, on two separate Windows installations
-
- Posts: 2461
- Joined: Sun Mar 02, 2003 9:22 pm
The properties are checked before and after saving a new file.
I was under the impression that page encoding and line termination are related/connected. After testing, I see that I am wrong. The line termination issue affected my work when working on the same documents in Linux. (on a dual boot desktop). I resolved this by setting Textpad line termination to Unix, since this does not affect my work in Windows.
However, I do have a problem with the page encoding. My default encoding is always UTF-8 with Unix line termination, but when I save an accented word like "Elisée", on reopening the same document the word changes to "Elisée" and the document encoding changes to 1252 - (ANSI - Latin 1). and I don't know what I am doing wrong.
I was under the impression that page encoding and line termination are related/connected. After testing, I see that I am wrong. The line termination issue affected my work when working on the same documents in Linux. (on a dual boot desktop). I resolved this by setting Textpad line termination to Unix, since this does not affect my work in Windows.
However, I do have a problem with the page encoding. My default encoding is always UTF-8 with Unix line termination, but when I save an accented word like "Elisée", on reopening the same document the word changes to "Elisée" and the document encoding changes to 1252 - (ANSI - Latin 1). and I don't know what I am doing wrong.
TextPad 8.16.0 64bit in English and TextPad 9.1.0 64bit in French, on two separate Windows installations
-
- Posts: 2461
- Joined: Sun Mar 02, 2003 9:22 pm
You wrote:
    The properties are checked before and after saving a new file.
and
    when I save an accented word like "Elisée", on reopening the same document ...
There is some ambiguity here.
The Unicode value of the character é is 0x00E9. The UTF-8 encoding of this value is the byte sequence 0xC3, 0xA9. The Windows Latin-1 decoding of these values is the character sequence é, which is what you are seeing.
In the absence of an explicit indication of the encoding of your text the editor must examine it and make a guess. If the text contains only a small proportion of non-ASCII characters the editor might conclude that the text is encoded in Windows Latin-1. That is what is happening here.
To solve this you could do one of these things:
    Increase the proportion of non-ASCII characters.
    But this is something you might have no control over.
    Include a byte order mark (BOM: Unicode 0xFEFF; UTF-8 0xEF, 0xBB, 0xBF) at the beginning of your document:
        File | Save As...
            Encoding: UTF-8        [X] UNICODE BOM
    But not all text-handling software is happy with a BOM at the beginning of the text.
    Save your session in a workspace and open the file by opening the workspace.
    This is probably the best solution.
Edit: Corrected typo.
    The properties are checked before and after saving a new file.
and
    when I save an accented word like "Elisée", on reopening the same document ...
There is some ambiguity here.
The Unicode value of the character é is 0x00E9. The UTF-8 encoding of this value is the byte sequence 0xC3, 0xA9. The Windows Latin-1 decoding of these values is the character sequence é, which is what you are seeing.
In the absence of an explicit indication of the encoding of your text the editor must examine it and make a guess. If the text contains only a small proportion of non-ASCII characters the editor might conclude that the text is encoded in Windows Latin-1. That is what is happening here.
To solve this you could do one of these things:
    Increase the proportion of non-ASCII characters.
    But this is something you might have no control over.
    Include a byte order mark (BOM: Unicode 0xFEFF; UTF-8 0xEF, 0xBB, 0xBF) at the beginning of your document:
        File | Save As...
            Encoding: UTF-8        [X] UNICODE BOM
    But not all text-handling software is happy with a BOM at the beginning of the text.
    Save your session in a workspace and open the file by opening the workspace.
    This is probably the best solution.
Edit: Corrected typo.
Last edited by ben_josephs on Tue May 21, 2019 5:16 pm, edited 1 time in total.
Addendum: Your explanation about a single UTF-8 character in a documentineuw wrote:ben_josephs, can't thank you enough for this explanation. It's very clear and concise.
is validated. It changes the code to 1252 (ANSI - Latin 1)
In another TP doc, in which there were a number of UTF-8 characters, the encoding remained as it is set in the Prefs = UTF-8.
TextPad 8.16.0 64bit in English and TextPad 9.1.0 64bit in French, on two separate Windows installations