Problem with UTF-8 encoded files

General questions about using TextPad

Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard

Post Reply
Thomas Fenner

Problem with UTF-8 encoded files

Post by Thomas Fenner »

When i open a UTF-8 encoded file, which includes umlauts, for example the german 'Ä', Unicode 00C4, it does show these umlauts correctly when there are more than two in this file. When there is only one ore two umlauts, then the character will not be correctly displayed. When you check the hex-code of the files content, everything looks fine. Also using another UTF-8 readable program like UniRed (http://www.esperanto.mv.ru/UniRed/UTF8/index.html) shows that the file is ok. I guess this problem is limited for all non-ANSI characters.
Christoph Nahr

Re: Problem with UTF-8 encoded files

Post by Christoph Nahr »

Does your have file a proper Unicode signature? If it doesn't Textpad can only guess at the encoding since UTF-8 is not its default file format (unlike UniRed?). That might explain why it doesn't interpret the file as UTF-8 when less than three umlauts are present. Try saving or generating the file with a signature.
Thomas Fenner

Re: Problem with UTF-8 encoded files

Post by Thomas Fenner »

I think so.

Just try this :

1. Open TextPad and Create a New File
2. Type in or copy & paste two umlauts (or any other non-ansi characters)
3. Save this file as UTF-8 and close TextPad
4. Open the file you saved in step 3
5. Wonder what has been happened to your characters

or

try it with this simple Java-program :

import java.io.*;

class Test {
public static void main(String[] args) {
try{
OutputStreamWriter osw = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF8");

// This works not (two chars ÄÄ)
String test = String.valueOf('\u00C4') + String.valueOf('\u00C4') ;

// This works fine (three chars ÄÄÄ)
String test = String.valueOf('\u00C4') + String.valueOf('\u00C4') + String.valueOf('\u00C4');

osw.write(test); osw.flush(); osw.close();
}catch(Exception e){}
}
}

:) Thomas
Christoph Nahr

Re: Problem with UTF-8 encoded files

Post by Christoph Nahr »

No, the problem is indeed that the signature is missing.

(The UTF-8 signature are the first three bytes of a text file: EF BB BF in hexadecimal. Please consult the Unicode standard for more info.)

I can replicate your problem just fine. Unfortunately, Textpad does not save a Unicode signature for UTF-8 files, and it has to guess at the encoding when opening a UTF-8 file without signature. Interestingly, Textpad does recognize an existing UTF-8 signature -- but it can't save it back!

To test this, you must create a UTF-8 file with signature. As I said, Textpad can't do this; I don't know if the Java stream classes can do it (the Microsoft .NET stream classes can). I just tested it with Visual Studio .NET which includes a native Unicode editor capable of saving UTF-8 with signature. A test file with just two umlauts was displayed correctly by Textpad after saving with signature in VS.NET. Textpad again dropped the signature when saving the unchanged data to a different file, though.

Conclusion: you cannot currently use Textpad to edit UTF-8 files that must have a signature. In particular, this means that you cannot currently use Textpad to edit files that contain so few extended characters that Textpad cannot guess at the UTF-8 encoding. This is a real problem and I hope it will be fixed in the next version (which should really be a native Unicode editor, in my opinion).
Andreas

Re: Problem with UTF-8 encoded files

Post by Andreas »

It is possible with textpad to create the UTF-8 signature.

Use an empty file, insert one character (does not matter which), do a replace with
.
by
\xEF\xBB\xBF
(make sure Regex is selected, and)

Now save the file as ANSI (if you save as UTF-8, the 3 characters get translated...)!

Close the file.
Reopen the file.
Go to view - Document properties.
It says UTF-8 (signature)

edit some stuff.
Save the file.
Close the file.
Reopen the file.
Go to view - Document properties.
It still says UTF-8 (signature)

Once the signature is there it persists.
As the signature is not really part of the file content, it just does not get displayed...

The only problem I can detect here is that no utf-8 signature is written if you save a new file as UTF-8.
Christoph Nahr

Re: Problem with UTF-8 encoded files

Post by Christoph Nahr »

Okay, manually creating a signature with character replacement is a possibility, though probably a bit awkward for daily use! Now if Textpad had macros one could write a macro for this task...

Opening with signature works for me, too. But I can't save back with a signature -- if I try to save back a UTF-8/signature file Textpad actually gives me an error: "Failed to save document"! And if I try to manually save as UTF-8 Textpad drops the signature, as usual.

Come to think of it, I do seem to recall that saving back a UTF-8/sig file used to work. But not now, at least not for me. Was this bug introduced in the latest version of Textpad? I'm running Textpad 4.5.0 (registered). Or is it perhaps an NT-specific problem? I'm using Windows XP Pro.
Thomas Fenner

Re: Problem with UTF-8 encoded files

Post by Thomas Fenner »

I just downloaded AbiWord 1.0.3 (http://www.abisource.com/download/), which is another little editor-program.
Saving the content "ÄÄÄ" into a UTF-8 encoded file, can be reopened as a still UTF-8 encoded file there. And the characters will be displayed well too.
Opening this file with TextPad brings success too.

Doing the same procedure with a content "ÄÄ", so TextPad only shows ANSI.

The hexcode HEXpert 3.0.21 (http://www.hexpertsystems.com/hexpert.html) for these files are :

1. "ÄÄÄ" : c384c384c384
2. "ÄÄ" : c384c384

Strange, there isnt a signature ...

Anyway, AbiWord seems the right program for me editing UTF-8 encoded files.
Peter

Re: Problem with UTF-8 encoded files

Post by Peter »

Hi guys,

The problem gets worse with Asian UTF-8 characters...

I can easily create a UTF-8 file with chinese characters under Windows-XP. A Hex dump of this file shows the correct encoding of the file and characters. If I try to open this file I get the error:

'WARNING: "tmp.xsl" contains characters that do not exist in teh code page
1252 (ANSI Latin I). They will be converted to the system default character
if you click OK.'

No combination of file open paramaters will deter this. And if you click on OK all the characters are converted to '???' strings. It doesn't help any to when you save it in UTF-8 format either.

So, I think it is a bit of marketing hype for Helios to claim they support UTF-8. I mean, come on, if NOTEPAD can do this can it really be so hard??? {;-)
Simon Wilson

Re: Problem with UTF-8 encoded files

Post by Simon Wilson »

I have had the exact same problem with chinese characters, and if I switch the default font code page then it complains about other characters in the file - as you say it really can't be *that* hard if Notepad can handle it. I have just downloaded UltraEdit and this opens the file as expected, no drama, no complaints and even more remarkably manages this while using standard fonts.
Guest

Re: Problem with UTF-8 encoded files

Post by Guest »

I also found an alternate editor.

It works great but I wish I could get this with all the TextPad features I like!
gbuhlman
Posts: 1
Joined: Fri Nov 09, 2007 11:29 pm

Re: Problem with UTF-8 encoded files

Post by gbuhlman »

Does TextPad ever respond to these threads? I have wanted to recommend using TextPad at my company for many years but this is still an issue and I don't know why they won't fix it.

At least now in 5.0 you can Save As UTF-8 and it will write the signature. But if you open a file with a UTF-8 signature and then save it, TextPad does not write it back.

This one bug prevents us from using TextPad.
User avatar
helios
Posts: 710
Joined: Sun Mar 02, 2003 5:52 pm
Location: Helios Software Solutions
Contact:

Post by helios »

Please check that the option to "Write Unicode and UTF-8 BOM" is ticked for the particular document class, on the options Preferences page.
Helios Software Solutions
jerrygiicojp
Posts: 1
Joined: Thu Mar 13, 2008 1:58 pm

Reading UTF-8

Post by jerrygiicojp »

I can't even read a UTF-8 (or Unicode) file successfully, because of the forced conversion to Latin-1. I often have to deal with a mixture of Chinese, Japanese, and Latin characters, for example from an Access table export. As has been said elsewhere, Notepad reads this like a champ. Regardless of the settings I've tried, I cannot get TextPad 5.2 to read these files without forcing a conversion to Latin-1 (codepage 1252).

So far I've been using some rather tortured work-arounds, such as using Excel as a text editor, but that's pretty lame.
User avatar
Nicholas Jordan
Posts: 124
Joined: Mon Dec 20, 2004 12:33 am
Location: Central Texas ISO Latin-1
Contact:

don't know if the Java stream classes can do it

Post by Nicholas Jordan »

Christoph Nahr wrote:(...snip...)

To test this, you must create a UTF-8 file with signature. As I said, Textpad can't do this; I don't know if the Java stream classes can do it (the Microsoft .NET stream classes can). I just tested it with Visual Studio .NET which includes a native Unicode editor capable of saving UTF-8 with signature. A test file with just two umlauts was displayed correctly by Textpad after saving with signature in VS.NET. Textpad again dropped the signature when saving the unchanged data to a different file, though.

(...snip...)
There is hope: Supported Encodings

( the following noted while working on this: )

P:\src\java\io\DataOutputStream.java - line 305

Code: Select all

public final void writeUTF(String str) throws IOException 
According to the comments which should be available in javadocs, the first two bytes are written giving the number of bytes to follow

That obviously is not a BOM
Post Reply