Page 1 of 1

Unmappable input sequence.

Posted: Sun May 23, 2010 11:19 am
by Mike Olds
Hello,

Attempting to replace '—' with the m-dash, I get a confusing result. The search is made, the changes are listed but with the note in red after each: Unmappable input sequence.

There is no option to abort.

I took a chance and closed the program. There was no option to delete backup files.

So. How DO I do this search and replace?

Time: 2010-May-23 04:12:06
Search Pattern: —
Replacement Format: —
Character Encoding: ISO-8859-1
Root folder: F:\All Webs\themozone\dhamma-vinaya\ati
Include Filter: *.txt *.htm *.html *.shtml
Exclude Filter: *.gif *.jpg *.jepg *.png *.ai *.pdf *.psd
Regular Expression: false
Match Words: false
Match Case: true
"." matches null characters: false
"." matches end of line characters: false
Literal Replacement Format: false
Search Subfolders: true
}

Example:


../dhamma-vinaya/ati/an/02_twos/an02.018.than.ati.htm: Character conversion: Unmappable input sequence.

Posted: Sun May 23, 2010 1:17 pm
by ben_josephs
Perhaps it's complaining about a character code error.

The log indicates that you specified ISO-8859-1 (Latin-1) encoding. There is no em dash in ISO-8859-1. There is, however, an em dash in Windows-1252 (WinLatin-1): it's the character with code 0x97. (Windows-1252 differs from ISO-8859-1 in the character codes 0x80 to 0x9F: in ISO-8859-1 thay are control codes; in Windows-1252 they are printable characters.)

What happens if you tell WildEdit that the encoding is Windows-1252?

Posted: Mon May 24, 2010 10:49 am
by Mike Olds
Good morning, Ben,

I did the following:

In TextPad I did a 'find-in-files' for '—. TextPad had no trouble finding such files.

Found a file with the '—'; copied to separate directory (I was afraid of what might happen).

Ran WildEdit as follows (no change of Character Encoding):

=== BEGIN REPLACE COMMAND ===
{
Time: 2010-May-24 03:40:21
Search Pattern: —
Replacement Format: —
Character Encoding: ISO-8859-1
Root folder: F:\All Webs\themozone\we_test
Include Filter: *.txt *.htm *.html *.shtml
Exclude Filter: *.gif *.jpg *.jepg *.png *.ai *.pdf *.psd
Regular Expression: false
Match Words: false
Match Case: true
"." matches null characters: false
"." matches end of line characters: false
Literal Replacement Format: false
Search Subfolders: true
}
F:/All Webs/themozone/we_test/mn.43.l354.mdash.htm: Character conversion: Unmappable input sequence.
Number of files searched: 2
Number of files modified: 1
Total changes made: 20
=== END REPLACE COMMAND ===

Then ran TextPad as follows (Changing Character Encoding). As you can see this worked just fine.

=== BEGIN REPLACE COMMAND ===
{
Time: 2010-May-24 03:41:53
Search Pattern: —
Replacement Format: —
Character Encoding: windows-1250
Root folder: F:\All Webs\themozone\we_test
Include Filter: *.txt *.htm *.html *.shtml
Exclude Filter: *.gif *.jpg *.jepg *.png *.ai *.pdf *.psd
Regular Expression: false
Match Words: false
Match Case: true
"." matches null characters: false
"." matches end of line characters: false
Literal Replacement Format: false
Search Subfolders: true
}
F:/All Webs/themozone/we_test/mn.43.l354.mdash.htm: 20 replacements made
Number of files searched: 2
Number of files modified: 1
Total changes made: 20
=== END REPLACE COMMAND ===

Doubts remain: Does this do anything with the information about the file that is stored with the file?

I can see it did not change the Character Encoding Meta Data, was anything hidden changed? Is this safe to do across my file sets which are all in ISO-8859-1?

Posted: Mon May 24, 2010 3:07 pm
by ben_josephs
What "character encoding meta data" are you referring to? There is nothing in a plain text file that distinguishes iso-8859-1 from windows-1252; only your knowledge of what's in it can do that.

Whether interpreting your files as encoded in windows-1252 is safe depends on how you are using them.

If they will be staying on Windows machines, it's probably OK.

If they will be transferred to a non-Windows machine, it's probably not OK.

And if they're web pages, and if they contain an element similar to
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
or contain no such element, then the Prince of Darkness is entitled to appear in the sitting room of anyone who views them through a browser (and, if it's an HTML 5 page, he probably will do). If you use windows-1252 you should change the meta element to
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

Why did you choose windows-1250 (Central European) instead of windows-1252 (Western)?

Posted: Mon May 24, 2010 5:18 pm
by Mike Olds
Hello Ben,

Thank you for your attention.

The 'meta-data' I was referring to was the meta-data in the header of an html page. All the files I am referring to are html pages.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="en">

<head>
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Style-Type" content="text/css">
<link rel="stylesheet" type="text/css" href="../../../../admin/styles/common.css" />

You ask:

Why did you choose windows-1250 (Central European) instead of windows-1252 (Western)?

? Ignorance?

Let me start from the beginning for a second here:

What I have here is a bunch of html files with the so-called 'meta data' as per the above example. I wasn't speaking about something that isn't visible in the file suchas of which I suspect you speak. Can we say that in English?

The default encoding I use in TextPad is ANSI

These files have the '&mdash;' named character which is what I want to change to the actual character '—' ...?...glyph? (Please correct this word usage and tolerate it as meaning the actual character as it appears).

So IN WILD-EDIT, where it indicates 'Encoding' I selected 'windows-1250' just because it was the first and looked familiar and then because it worked.

Then, opening the files in Opera or I.E.5 (I'm W2k) the 'glyph' character for the m-dash shows up fine.

I do not have a Linux machine at the moment, nor do I have a Mac to check what happens there.

Are you saying that wherever I am using iso-8859-1 and have simply put in the glyph character it is unlikely to come out properly in Mac/Linux? That I had better just stick with, or convert all files to '&mdash;'?

Search and replace in TextPad works fine for this problem which is one thing contributing to my confusion here about what is going on.

Posted: Mon May 24, 2010 8:56 pm
by ben_josephs
As I said earlier, there is no em dash in iso-8859-1. In windows-1252, em dash has the code 0x97. In iso-8859-1, 0x97 is the code for the obscure control character EPA (End of Guarded Area). I have no idea what it's for, but it is not allowed in an HTML document.

So if an HTML document states that it's encoded in iso-8859-1 but contains a windows-1252 em dash, it's not well-formed. Most browsers will, however, interpret it as the author presumably intended. But this cannot be guaranteed.

If you want to use a literal em dash, you should serve up your document as windows-1252 or, if you'd like to show your international credentials, as utf-8.

Alternatively, you can stick to iso-8859-1 and leave each &mdash; as it is and convert each em dash to &mdash;. (You can also use &#x2014; as the Unicode code for em dash is U+2014.)

Posted: Mon May 24, 2010 10:01 pm
by Mike Olds
OK! Thanks Ben. It will be the last option for the time being.

Nevermind this

Posted: Tue May 25, 2010 11:48 am
by Mike Olds
Hello,

This will not work for me at this time as it will require the conversion of a set of custom characters to ASCII, which I am not yet prepared to do.

So I have a fully functioning webside that will appear properly only on I.E. 5-6

It would be helpful to know if the below was the correct procedure for a normal conversion.

==============================

Following on this discussion, I would like to change all my files to utf-8 encoding.

I plan to:

1. Do a search and replace across all the files to
find:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
replace with:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

2. Change configuration in TextPad:
Configure>Preferences>Document Classes>
Default
[x] Write Unicode and UTF-8 BOM
Default Encoding [x] UTF-8
Create New File as [x] PC

Apply these settings to all document classes.

Check.