UTF-8 document brings ANSI error

martib · Post by **martib** » Wed Feb 23, 2011 12:36 am

In Textpad 5.4.2 I select a document to open and choose Encoding UTF-8 but Textpad brings the error message "WARNING: "TEST.txt" contains characters that do not exist in code page 1252 (ANSI - Latin I). They will be converted to the system default character, if you click OK."

I set UFT-8 as default encoding in Textpad and I am sure it really is a UTF-8 document.

What can this error mean and how to solve this issue?

SteveH · Post by **SteveH** » Wed Feb 23, 2011 12:36 pm

I think you are encoutering this problem from the TextPad help file:

it is only possible to edit, without data loss, files containing characters from the implied code page. Other characters will be converted into a system default character (normally "?"), if you confirm that is what you want to do.

Even though I'm a huge fan of TextPad, I would be wary of using it to work with UTF files.

martib · Post by **martib** » Sat Feb 26, 2011 10:09 am

Haven't found something adiquate for UTF-8.
Most of them cannot handle huge files. I tried jEdit.

Do you have a recommandation?

SteveH · Post by **SteveH** » Sat Feb 26, 2011 10:23 am

Rather than advertise competing products I've sent you a DM

Ryck · Post by **Ryck** » Sun Feb 27, 2011 11:04 am

SteveH wrote:Rather than advertise competing products I've sent you a DM

I ran into the same problem late last year. I ended up writing something in Visual Basic to handle it. I would be curious to know what product you used if that path would have been easier - if you don't mind. Thanks.

ben_josephs · Post by **ben_josephs** » Sun Feb 27, 2011 12:53 pm

The problem is not that TextPad cannot handle the UTF-8 encoding, which is a way of transmitting and storing the text of a document. Rather, TextPad cannot handle the Unicode character set, which is an assignment of numerical values to more than 100 000 characters.

That is, TextPad can decode text that is encoded in UTF-8, and it can encode text into UTF-8. But it stores all the text of each document internally in a single 8-bit character set (a "code page" or "script"). Only 256 characters (including control characters) can be represented in an 8-bit character set.

If all the characters in your document are in a single code page and if that code page is available, select it. Otherwise TextPad will not display all the characters correctly.

The range of code pages available depends on the font you are using. You can select the code page at
Configure | Preferences | Document Classes | <Class> | Font | Script
or
View | Document Properties | Font | Script.

Here is a correspondence between some script names and code pages:

Code: Select all

Western            1252
Greek              1253
Turkish            1254
Central European   1250
Cyrillic           1251

Ryck · Post by **Ryck** » Sun Feb 27, 2011 5:32 pm

ben_josephs wrote:The problem is not that TextPad cannot handle the UTF-8 encoding, which is a way of transmitting and storing the text of a document. Rather, TextPad cannot handle the Unicode character set, which is an assignment of numerical values to more than 100 000 characters.

That is, TextPad can decode text that is encoded in UTF-8, and it can encode text into UTF-8. But it stores all the text of each document internally in a single 8-bit character set (a "code page" or "script"). Only 256 characters (including control characters) can be represented in an 8-bit character set.

If all the characters in your document are in a single code page and if that code page is available, select it. Otherwise TextPad will not display all the characters correctly.

The range of code pages available depends on the font you are using. You can select the code page at
Configure | Preferences | Document Classes | <Class> | Font | Script
or
View | Document Properties | Font | Script.

Here is a correspondence between some script names and code pages:
Code: Select all
Western            1252
Greek              1253
Turkish            1254
Central European   1250
Cyrillic           1251

Thanks Ben. I find it insightful to know why something works or dosen't work so I can plan ahead or take a different course of action.

We have started to use a competitor's product for things we can't do in Textpad. Unfortunately this is starting to occur more frequently.

aceone · Post by **aceone** » Tue Mar 08, 2011 9:12 pm

I just ran into the exact same problem. Having a text file containing various unicode characters, and Textpad displays them as ?'s.

I am still confused about what ben josephs said about handling unicode "encoding" vs unicode "character set". I thought the purpose of having unicode is to eliminate these different and incompatible character sets so that a single document containing various languages can be displayed properly. Could some of the experts here who have more knowledge regarding encoding and character set shed more light, please? Thanks.

SteveH · Post by **SteveH** » Tue Mar 08, 2011 10:07 pm

I think what ben_josephs is referring to is the difference between the character set (i.e the supported symbols) and how they are represented (or encoded). The two terms are often used interchangeably.

Loking at the 'â‚¬' symbols as an example:

This has code point 20AC in hex but can be encoded as AC 20 in UTF16 (Little Endian) or 20 AC in UTF16 (Big Endian) or E2 82 AC in UTF8.

Windows applications that don't use Unicode save text files using one of the Windows code pages, often called "ANSI" code pages. TextPad can only accommodate characters from a single code page while being able to display the characters and save them. So for instance the characters fit into say the Windows 1252 character encoding scheme which comprises around 224 'real' characters.

Once you have a text editor that supports Unicode text, you then need to start using fonts that include all the required glyphs too. After that you may also need to support languages that write from right to left too. Not a simple thing.

Hope this helps.

ben_josephs · Post by **ben_josephs** » Tue Mar 08, 2011 10:30 pm

[ I'd already written this... ]

Unicode is a 21-bit character set: each character has a numerical value that can be represented in 21 bits.

There are several ways in which a stream of 21-bit values can be stored. UTF-8 is one of them, using from one to four bytes for each Unicode value. ASCII, a 7-bit character set, is a subset of Unicode, and in UTF-8 every ASCII value is encoded in a single byte. Other Unicode values are encoded in two, three or four bytes.

Google will help you find many descriptions of UTF-8.

aceone · Post by **aceone** » Wed Mar 09, 2011 8:54 pm

@SteveH Thank you for the explanation, and from what you described, I think the key to the issue is whether for TextPad to save files in unicode or not. Am I correct to interpret it as despite the fact that TextPad Save dialog allows one to say a text file in or ASCII or ANSI or UTF-8, it does not support UTF-8 natively and still needs to translate the characters into code pages?

Although it would be nice to let TextPad to have the full support including different text orientations, but supporting UTF-8 would be a good starting point. Notepad, despite how pritmive it is, does support UTF-8 natively, and it doesn't do different text orientation, and it handles missing fonts with blocks. I guess users can fully understand it if the chosen fonts to display the text does not contain the full unicode set. But on the other hand, I was confused when I picked the right fonts (Arial Unicode) but was still unable to display the text correctly.

@ ben_josephs I think you had written this

, but for some reason, I was having some difficulties interpreting it.

SteveH · Post by **SteveH** » Wed Mar 09, 2011 9:54 pm

aceone wrote:Am I correct to interpret it as despite the fact that TextPad Save dialog allows one to say a text file in or ASCII or ANSI or UTF-8, it does not support UTF-8 natively and still needs to translate the characters into code pages?

As Helios say it themselves, I think it boils down to...

it is only possible to edit, without data loss, files containing characters from the implied code page

One of the good web resources on Unicode is alanwood.net and they describe textPad as follows:

It can be used to edit UTF-16 and UTF-8 files, but only with ... the characters for one codepage, not for the whole of Unicode.TextPad only supports a single font, so for multi-script Web pages a large font such as Arial Unicode MS is needed in order to show all of the characters.

Hope this helps.

aceone · Post by **aceone** » Wed Mar 09, 2011 10:15 pm

it certainly helps... and i just wanted to point out that even with arial unicode ms font, textpad still only displays one code page at a time and renders the rest as ?'s.

jannypan · Post by **jannypan** » Thu Mar 17, 2011 2:14 am

Haven't found something adiquate for UTF-8.

aceone · Post by **aceone** » Thu Mar 17, 2011 12:59 pm

if you must, use notepad for UTF-8.

I feel your pain. You'd thought notepad is at the bottom of the evolution chain of text editors, yet it supports unicode fully; still cannot fully comprehend why textpad which is supposed to be one of the most advanced text editors, does not support it.