Problem with 1 or 2 UTF-8 characters

General questions about using TextPad

Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard

Post Reply
Plasm
Posts: 3
Joined: Tue Feb 16, 2016 2:08 pm

Problem with 1 or 2 UTF-8 characters

Post by Plasm »

Hi, I encountered another problem in 8.1.2 (64Bit) related to UTF-8 encoding:
If I open a file with only one or two UTF-8 characters, the file is loaded as ANSI which leads to a broken character presentation. Even if the file open dialog is used and the charset is set to UTF-8 explicitly, the file is loaded as ANSI.
If the file has at least three UTF-8 characters, everything works fine.

Example with german umlauts:
ä => ä
äö => äö
äöü => äöü

Best regards
Plasm
Plasm
Posts: 3
Joined: Tue Feb 16, 2016 2:08 pm

Post by Plasm »

To be clear: The problem occurs if there are only one/two UTF-8 characters amongst others.

Thus:
äbcdefghijklmnöpqrstuvwxyz => äbcdefghijklmnöpqrstuvwxyz (2x UTF-8)
äbcdefghijklmnöpqrstüvwxyz => äbcdefghijklmnöpqrstüvwxyz (3x UTF-8)

[No edit privilige, unfortunately]
bluesix
Posts: 8
Joined: Wed Aug 26, 2015 5:41 am

Post by bluesix »

I can report the same issue.
When re-opening files saved as UTF-8, they opened as ANSI - Latin 1 and are therefore corrupted.
User avatar
christiandittmann41
Posts: 5
Joined: Fri Jul 15, 2016 10:10 pm
Location: Düsseldorf/Germany, La Nucia, Spain

one, to three utf chars

Post by christiandittmann41 »

Hello!
I've tested your problem with Win10 and TP32. All is ok.
The error occurs only in the 64bit version.
So, the workaround is to use the 32bit version of TP.
Why do you think that you really need the 64bit version? This is ridiculous, no one edits such large files and in a dialog program speed is secondary...

So long
Christian, the kraut, from good old germany
User avatar
AmigoJack
Posts: 515
Joined: Sun Oct 30, 2016 4:28 pm
Location: グリーン ヒル ゾーン
Contact:

Re: one, to three utf chars

Post by AmigoJack »

Thanks for this hint, although it doesn't make that much sense why a different platform compilation should behave differently in its logic.


christiandittmann41 wrote:Why do you think that you really need the 64bit version?
Because the system is 64bit and every process not being 64bit needs to be adapted, hence running effectively slower.
christiandittmann41 wrote:no one edits such large files
I do (i.e. 2.6 GiB files) and I am someone.
christiandittmann41 wrote:in a dialog program speed is secondary
By that you mean speed in your internet browser, your photo editor, your file manager and probably non-fullscreen games as well the speed is not important to you? I have my doubts.
User avatar
jmparatte
Posts: 4
Joined: Fri Nov 25, 2011 11:07 am
Location: Switzerland

Re: Problem with 1 or 2 UTF-8 characters

Post by jmparatte »

Plasm wrote:Hi, I encountered another problem in 8.1.2 (64Bit) related to UTF-8 encoding:
If I open a file with only one or two UTF-8 characters, the file is loaded as ANSI which leads to a broken character presentation. Even if the file open dialog is used and the charset is set to UTF-8 explicitly, the file is loaded as ANSI.
If the file has at least three UTF-8 characters, everything works fine.

Example with german umlauts:
ä => ä
äö => äö
äöü => äöü

Best regards
Plasm
My solution is to insert at beginning of for example a PHP file:

Code: Select all

<?php //éèà
...
?>
The "éèà" 3 non-ascii characters placed very near the beginning of file is analyzed and correctly decoded to switch the encoding as an UTF-8 file.
If the same sequence is placed too far from the beginning, the encoding could be incorrectly determined.
Plasm
Posts: 3
Joined: Tue Feb 16, 2016 2:08 pm

Post by Plasm »

Problem still persists in 8.2.0 (64 Bit).

Test case:
- Create a new file
- Write: "äeiöu"
- Save the file as UTF-8 without BOM
- Close the file (or Textpad itself)
- Open the file by double-clicking on it, from the open dialog or via dragging it into Textpad (doesn't matter)
- Result: Textpad displays "äeiöu"

The file is saved correctly (tested with other editors). The Problem occurs at opening the file.
If there are more than 2 UTF-8 characters, everything is fine. For example: "äeiöü" results in "äeiöü".

BTW: I saved the file as .txt. The Text document class has UTF-8 charset and no BOM as default settings, if that matters.

Best regards
Plasm
User avatar
jmparatte
Posts: 4
Joined: Fri Nov 25, 2011 11:07 am
Location: Switzerland

Post by jmparatte »

Plasm wrote:...example: "äeiöü" results in "äeiöü"...
The decoding at open fails also with 3 non-ascii characters when the 3 non-ascii characters are not consecutive.
ben_josephs
Posts: 2461
Joined: Sun Mar 02, 2003 9:22 pm

Post by ben_josephs »

In https://forums.textpad.com/viewtopic.php?t=13253 I suggested:

    Save your session in a workspace and open the file by opening the workspace.

Is that a suitable solution?

(I use workspaces for all my non-transient editing work.)
Post Reply