Page 1 of 1
Unmappable input sequence.
Posted: Sun May 23, 2010 11:19 am
by Mike Olds
Hello,
Attempting to replace '—' with the m-dash, I get a confusing result. The search is made, the changes are listed but with the note in red after each: Unmappable input sequence.
There is no option to abort.
I took a chance and closed the program. There was no option to delete backup files.
So. How DO I do this search and replace?
Time: 2010-May-23 04:12:06
Search Pattern: —
Replacement Format: —
Character Encoding: ISO-8859-1
Root folder: F:\All Webs\themozone\dhamma-vinaya\ati
Include Filter: *.txt *.htm *.html *.shtml
Exclude Filter: *.gif *.jpg *.jepg *.png *.ai *.pdf *.psd
Regular Expression: false
Match Words: false
Match Case: true
"." matches null characters: false
"." matches end of line characters: false
Literal Replacement Format: false
Search Subfolders: true
}
Example:
../dhamma-vinaya/ati/an/02_twos/an02.018.than.ati.htm: Character conversion: Unmappable input sequence.
Posted: Sun May 23, 2010 1:17 pm
by ben_josephs
Perhaps it's complaining about a character code error.
The log indicates that you specified ISO-8859-1 (Latin-1) encoding. There is no em dash in ISO-8859-1. There is, however, an em dash in Windows-1252 (WinLatin-1): it's the character with code 0x97. (Windows-1252 differs from ISO-8859-1 in the character codes 0x80 to 0x9F: in ISO-8859-1 thay are control codes; in Windows-1252 they are printable characters.)
What happens if you tell WildEdit that the encoding is Windows-1252?
Posted: Mon May 24, 2010 10:49 am
by Mike Olds
Good morning, Ben,
I did the following:
In TextPad I did a 'find-in-files' for '—. TextPad had no trouble finding such files.
Found a file with the '—'; copied to separate directory (I was afraid of what might happen).
Ran WildEdit as follows (no change of Character Encoding):
=== BEGIN REPLACE COMMAND ===
{
Time: 2010-May-24 03:40:21
Search Pattern: —
Replacement Format: —
Character Encoding: ISO-8859-1
Root folder: F:\All Webs\themozone\we_test
Include Filter: *.txt *.htm *.html *.shtml
Exclude Filter: *.gif *.jpg *.jepg *.png *.ai *.pdf *.psd
Regular Expression: false
Match Words: false
Match Case: true
"." matches null characters: false
"." matches end of line characters: false
Literal Replacement Format: false
Search Subfolders: true
}
F:/All Webs/themozone/we_test/mn.43.l354.mdash.htm: Character conversion: Unmappable input sequence.
Number of files searched: 2
Number of files modified: 1
Total changes made: 20
=== END REPLACE COMMAND ===
Then ran TextPad as follows (Changing Character Encoding). As you can see this worked just fine.
=== BEGIN REPLACE COMMAND ===
{
Time: 2010-May-24 03:41:53
Search Pattern: —
Replacement Format: —
Character Encoding: windows-1250
Root folder: F:\All Webs\themozone\we_test
Include Filter: *.txt *.htm *.html *.shtml
Exclude Filter: *.gif *.jpg *.jepg *.png *.ai *.pdf *.psd
Regular Expression: false
Match Words: false
Match Case: true
"." matches null characters: false
"." matches end of line characters: false
Literal Replacement Format: false
Search Subfolders: true
}
F:/All Webs/themozone/we_test/mn.43.l354.mdash.htm: 20 replacements made
Number of files searched: 2
Number of files modified: 1
Total changes made: 20
=== END REPLACE COMMAND ===
Doubts remain: Does this do anything with the information about the file that is stored with the file?
I can see it did not change the Character Encoding Meta Data, was anything hidden changed? Is this safe to do across my file sets which are all in ISO-8859-1?
Posted: Mon May 24, 2010 3:07 pm
by ben_josephs
What "character encoding meta data" are you referring to? There is nothing in a plain text file that distinguishes iso-8859-1 from windows-1252; only your knowledge of what's in it can do that.
Whether interpreting your files as encoded in windows-1252 is safe depends on how you are using them.
If they will be staying on Windows machines, it's probably OK.
If they will be transferred to a non-Windows machine, it's probably not OK.
And if they're web pages, and if they contain an element similar to
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
or contain no such element, then the Prince of Darkness is entitled to appear in the sitting room of anyone who views them through a browser (and, if it's an HTML 5 page, he probably will do). If you use windows-1252 you should change the meta element to
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
Why did you choose windows-1250 (Central European) instead of windows-1252 (Western)?
Posted: Mon May 24, 2010 5:18 pm
by Mike Olds
Hello Ben,
Thank you for your attention.
The 'meta-data' I was referring to was the meta-data in the header of an html page. All the files I am referring to are html pages.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="en">
<head>
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Style-Type" content="text/css">
<link rel="stylesheet" type="text/css" href="../../../../admin/styles/common.css" />
You ask:
Why did you choose windows-1250 (Central European) instead of windows-1252 (Western)?
? Ignorance?
Let me start from the beginning for a second here:
What I have here is a bunch of html files with the so-called 'meta data' as per the above example. I wasn't speaking about something that isn't visible in the file suchas of which I suspect you speak. Can we say that in English?
The default encoding I use in TextPad is ANSI
These files have the '—' named character which is what I want to change to the actual character '—' ...?...glyph? (Please correct this word usage and tolerate it as meaning the actual character as it appears).
So IN WILD-EDIT, where it indicates 'Encoding' I selected 'windows-1250' just because it was the first and looked familiar and then because it worked.
Then, opening the files in Opera or I.E.5 (I'm W2k) the 'glyph' character for the m-dash shows up fine.
I do not have a Linux machine at the moment, nor do I have a Mac to check what happens there.
Are you saying that wherever I am using iso-8859-1 and have simply put in the glyph character it is unlikely to come out properly in Mac/Linux? That I had better just stick with, or convert all files to '—'?
Search and replace in TextPad works fine for this problem which is one thing contributing to my confusion here about what is going on.
Posted: Mon May 24, 2010 8:56 pm
by ben_josephs
As I said earlier, there is no em dash in iso-8859-1. In windows-1252, em dash has the code 0x97. In iso-8859-1, 0x97 is the code for the obscure control character EPA (End of Guarded Area). I have no idea what it's for, but it is not allowed in an HTML document.
So if an HTML document states that it's encoded in iso-8859-1 but contains a windows-1252 em dash, it's not well-formed. Most browsers will, however, interpret it as the author presumably intended. But this cannot be guaranteed.
If you want to use a literal em dash, you should serve up your document as windows-1252 or, if you'd like to show your international credentials, as utf-8.
Alternatively, you can stick to iso-8859-1 and leave each — as it is and convert each em dash to —. (You can also use — as the Unicode code for em dash is U+2014.)
Posted: Mon May 24, 2010 10:01 pm
by Mike Olds
OK! Thanks Ben. It will be the last option for the time being.
Nevermind this
Posted: Tue May 25, 2010 11:48 am
by Mike Olds
Hello,
This will not work for me at this time as it will require the conversion of a set of custom characters to ASCII, which I am not yet prepared to do.
So I have a fully functioning webside that will appear properly only on I.E. 5-6
It would be helpful to know if the below was the correct procedure for a normal conversion.
==============================
Following on this discussion, I would like to change all my files to utf-8 encoding.
I plan to:
1. Do a search and replace across all the files to
find:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
replace with:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
2. Change configuration in TextPad:
Configure>Preferences>Document Classes>
Default
[x] Write Unicode and UTF-8 BOM
Default Encoding [x] UTF-8
Create New File as [x] PC
Apply these settings to all document classes.
Check.