UTF-8 document brings ANSI error
Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard
UTF-8 document brings ANSI error
In Textpad 5.4.2 I select a document to open and choose Encoding UTF-8 but Textpad brings the error message "WARNING: "TEST.txt" contains characters that do not exist in code page 1252 (ANSI - Latin I). They will be converted to the system default character, if you click OK."
I set UFT-8 as default encoding in Textpad and I am sure it really is a UTF-8 document.
What can this error mean and how to solve this issue?
I set UFT-8 as default encoding in Textpad and I am sure it really is a UTF-8 document.
What can this error mean and how to solve this issue?
I think you are encoutering this problem from the TextPad help file:
Even though I'm a huge fan of TextPad, I would be wary of using it to work with UTF files.it is only possible to edit, without data loss, files containing characters from the implied code page. Other characters will be converted into a system default character (normally "?"), if you confirm that is what you want to do.
Running TextPad 5.4 on Windows XP SP3 and on OS X 10.7 under VMWare or Crossover.
-
- Posts: 2461
- Joined: Sun Mar 02, 2003 9:22 pm
The problem is not that TextPad cannot handle the UTF-8 encoding, which is a way of transmitting and storing the text of a document. Rather, TextPad cannot handle the Unicode character set, which is an assignment of numerical values to more than 100 000 characters.
That is, TextPad can decode text that is encoded in UTF-8, and it can encode text into UTF-8. But it stores all the text of each document internally in a single 8-bit character set (a "code page" or "script"). Only 256 characters (including control characters) can be represented in an 8-bit character set.
If all the characters in your document are in a single code page and if that code page is available, select it. Otherwise TextPad will not display all the characters correctly.
The range of code pages available depends on the font you are using. You can select the code page at
Configure | Preferences | Document Classes | <Class> | Font | Script
or
View | Document Properties | Font | Script.
Here is a correspondence between some script names and code pages:
That is, TextPad can decode text that is encoded in UTF-8, and it can encode text into UTF-8. But it stores all the text of each document internally in a single 8-bit character set (a "code page" or "script"). Only 256 characters (including control characters) can be represented in an 8-bit character set.
If all the characters in your document are in a single code page and if that code page is available, select it. Otherwise TextPad will not display all the characters correctly.
The range of code pages available depends on the font you are using. You can select the code page at
Configure | Preferences | Document Classes | <Class> | Font | Script
or
View | Document Properties | Font | Script.
Here is a correspondence between some script names and code pages:
Code: Select all
Western 1252
Greek 1253
Turkish 1254
Central European 1250
Cyrillic 1251
Thanks Ben. I find it insightful to know why something works or dosen't work so I can plan ahead or take a different course of action.ben_josephs wrote:The problem is not that TextPad cannot handle the UTF-8 encoding, which is a way of transmitting and storing the text of a document. Rather, TextPad cannot handle the Unicode character set, which is an assignment of numerical values to more than 100 000 characters.
That is, TextPad can decode text that is encoded in UTF-8, and it can encode text into UTF-8. But it stores all the text of each document internally in a single 8-bit character set (a "code page" or "script"). Only 256 characters (including control characters) can be represented in an 8-bit character set.
If all the characters in your document are in a single code page and if that code page is available, select it. Otherwise TextPad will not display all the characters correctly.
The range of code pages available depends on the font you are using. You can select the code page at
Configure | Preferences | Document Classes | <Class> | Font | Script
or
View | Document Properties | Font | Script.
Here is a correspondence between some script names and code pages:Code: Select all
Western 1252 Greek 1253 Turkish 1254 Central European 1250 Cyrillic 1251
We have started to use a competitor's product for things we can't do in Textpad. Unfortunately this is starting to occur more frequently.
I just ran into the exact same problem. Having a text file containing various unicode characters, and Textpad displays them as ?'s.
I am still confused about what ben josephs said about handling unicode "encoding" vs unicode "character set". I thought the purpose of having unicode is to eliminate these different and incompatible character sets so that a single document containing various languages can be displayed properly. Could some of the experts here who have more knowledge regarding encoding and character set shed more light, please? Thanks.
I am still confused about what ben josephs said about handling unicode "encoding" vs unicode "character set". I thought the purpose of having unicode is to eliminate these different and incompatible character sets so that a single document containing various languages can be displayed properly. Could some of the experts here who have more knowledge regarding encoding and character set shed more light, please? Thanks.
I think what ben_josephs is referring to is the difference between the character set (i.e the supported symbols) and how they are represented (or encoded). The two terms are often used interchangeably.
Loking at the '€' symbols as an example:
This has code point 20AC in hex but can be encoded as AC 20 in UTF16 (Little Endian) or 20 AC in UTF16 (Big Endian) or E2 82 AC in UTF8.
Windows applications that don't use Unicode save text files using one of the Windows code pages, often called "ANSI" code pages. TextPad can only accommodate characters from a single code page while being able to display the characters and save them. So for instance the characters fit into say the Windows 1252 character encoding scheme which comprises around 224 'real' characters.
Once you have a text editor that supports Unicode text, you then need to start using fonts that include all the required glyphs too. After that you may also need to support languages that write from right to left too. Not a simple thing.
Hope this helps.
Loking at the '€' symbols as an example:
This has code point 20AC in hex but can be encoded as AC 20 in UTF16 (Little Endian) or 20 AC in UTF16 (Big Endian) or E2 82 AC in UTF8.
Windows applications that don't use Unicode save text files using one of the Windows code pages, often called "ANSI" code pages. TextPad can only accommodate characters from a single code page while being able to display the characters and save them. So for instance the characters fit into say the Windows 1252 character encoding scheme which comprises around 224 'real' characters.
Once you have a text editor that supports Unicode text, you then need to start using fonts that include all the required glyphs too. After that you may also need to support languages that write from right to left too. Not a simple thing.
Hope this helps.
Running TextPad 5.4 on Windows XP SP3 and on OS X 10.7 under VMWare or Crossover.
-
- Posts: 2461
- Joined: Sun Mar 02, 2003 9:22 pm
[ I'd already written this... ]
Unicode is a 21-bit character set: each character has a numerical value that can be represented in 21 bits.
There are several ways in which a stream of 21-bit values can be stored. UTF-8 is one of them, using from one to four bytes for each Unicode value. ASCII, a 7-bit character set, is a subset of Unicode, and in UTF-8 every ASCII value is encoded in a single byte. Other Unicode values are encoded in two, three or four bytes.
Google will help you find many descriptions of UTF-8.
Unicode is a 21-bit character set: each character has a numerical value that can be represented in 21 bits.
There are several ways in which a stream of 21-bit values can be stored. UTF-8 is one of them, using from one to four bytes for each Unicode value. ASCII, a 7-bit character set, is a subset of Unicode, and in UTF-8 every ASCII value is encoded in a single byte. Other Unicode values are encoded in two, three or four bytes.
Google will help you find many descriptions of UTF-8.
@SteveH Thank you for the explanation, and from what you described, I think the key to the issue is whether for TextPad to save files in unicode or not. Am I correct to interpret it as despite the fact that TextPad Save dialog allows one to say a text file in or ASCII or ANSI or UTF-8, it does not support UTF-8 natively and still needs to translate the characters into code pages?
Although it would be nice to let TextPad to have the full support including different text orientations, but supporting UTF-8 would be a good starting point. Notepad, despite how pritmive it is, does support UTF-8 natively, and it doesn't do different text orientation, and it handles missing fonts with blocks. I guess users can fully understand it if the chosen fonts to display the text does not contain the full unicode set. But on the other hand, I was confused when I picked the right fonts (Arial Unicode) but was still unable to display the text correctly.
@ ben_josephs I think you had written this , but for some reason, I was having some difficulties interpreting it.
Although it would be nice to let TextPad to have the full support including different text orientations, but supporting UTF-8 would be a good starting point. Notepad, despite how pritmive it is, does support UTF-8 natively, and it doesn't do different text orientation, and it handles missing fonts with blocks. I guess users can fully understand it if the chosen fonts to display the text does not contain the full unicode set. But on the other hand, I was confused when I picked the right fonts (Arial Unicode) but was still unable to display the text correctly.
@ ben_josephs I think you had written this , but for some reason, I was having some difficulties interpreting it.
As Helios say it themselves, I think it boils down to...aceone wrote:Am I correct to interpret it as despite the fact that TextPad Save dialog allows one to say a text file in or ASCII or ANSI or UTF-8, it does not support UTF-8 natively and still needs to translate the characters into code pages?
it is only possible to edit, without data loss, files containing characters from the implied code page
One of the good web resources on Unicode is alanwood.net and they describe textPad as follows:
It can be used to edit UTF-16 and UTF-8 files, but only with ... the characters for one codepage, not for the whole of Unicode.TextPad only supports a single font, so for multi-script Web pages a large font such as Arial Unicode MS is needed in order to show all of the characters.
Hope this helps.
Running TextPad 5.4 on Windows XP SP3 and on OS X 10.7 under VMWare or Crossover.