named character entities

WayneCa · Post by **WayneCa** » Wed Jun 10, 2020 4:45 pm

I edit web pages. In them, I use named entities like  . These are usually hilited as a dark green. However, today I used — and – for the first time. Instead of being hilited dark green, they were left white. I take this to mean that TextPad doesn't recognize them. What file do I edit to add these names to the list? I looked at the html.syn file and it doesn't contain them. I looked at the colors for the HTML document class, and that color is not one of the colors I set when I customized the settings.

Post by **AmigoJack** » Wed Jun 10, 2020 8:07 pm

WayneCa wrote:I looked at the html.syn file and it doesn't contain them.

But it contains the following 3 interesting lines:

Code: Select all

HTML=1
...
CharStart = &
CharEnd = ;

That means that HTML entities are stored in TextPad and those don't need to be defined elsewhere. As per https://en.wikipedia.org/wiki/List_of_X ... es_in_HTML TextPad might only support pre 4.0 entities.

Why would nowadays someone need entities anyway? Unicode solves this for decades already - only the 4 XML entities are needed for angle brackets, ampersand and quotation to not interfere with the syntax - anything else can be easily used directly as per text encoding.

WayneCa · Post by **WayneCa** » Wed Jun 10, 2020 8:53 pm

AmigoJack wrote:That means that HTML entities are stored in TextPad and those don't need to be defined elsewhere. As per https://en.wikipedia.org/wiki/List_of_X ... es_in_HTML TextPad might only support pre 4.0 entities.

I had not looked at wikipedia. TextPad only supporting the 8-bit entities had not occurred to me.

AmigoJack wrote:Why would nowadays someone need entities anyway? Unicode solves this for decades already - only the 4 XML entities are needed for angle brackets, ampersand and quotation to not interfere with the syntax - anything else can be easily used directly as per text encoding.

I know that the browsers display the characters without an issue, but apparently the vnu validator doesn't agree. According to it every character that I have replaced with a named entity was seen by it as "malformed". Replacing it with an entity corrected the error. Since vnu is the validator designed by the W3C, I assumed it was current. I use the standalone version as an external tool in TextPad to validate the pages I am editing. The only things that don't validate are the characters that I have to replace with a named entity.

Thank you for replying as quickly as you did. It helps me get on with other things more quickly.

Post by **AmigoJack** » Wed Jun 10, 2020 10:07 pm

WayneCa wrote:the vnu validator

Care to link?

WayneCa wrote:According to it every character that I have replaced with a named entity was seen by it as "malformed". Replacing it with an entity corrected the error. ... I use the standalone version as an external tool in TextPad to validate the pages I am editing. The only things that don't validate are the characters that I have to replace with a named entity.

Sounds a bit contradicting. Most likely the error is not any validator, but how you feed it: if it's direct I/O then take care of which encodings are used and expected. It's rarily clear who assumes/provides UTF-8, 8859 or even 850. Otherwise save your file with an UTF-BOM and open it directly in "the tool".

Just as with https://validator.w3.org/ you better upload a file to make sure copy/paste text encodings don't interfere.

WayneCa · Post by **WayneCa** » Wed Jun 10, 2020 11:46 pm

AmigoJack wrote:
WayneCa wrote:the vnu validator
Care to link?

https://validator.github.io/validator/ This is where I downloaded it. It runs standalone from the command line or can be run from TextPad as a Tool.

AmigoJack wrote:Sounds a bit contradicting. Most likely the error is not any validator, but how you feed it: if it's direct I/O then take care of which encodings are used and expected. It's rarily clear who assumes/provides UTF-8, 8859 or even 850. Otherwise save your file with an UTF-BOM and open it directly in "the tool".

Just as with https://validator.w3.org/ you better upload a file to make sure copy/paste text encodings don't interfere.

How is it contradicting? I put the em-dash character back into the document and ran the validator just so I could show you the output. In TextPad I have it set under the Tool category with the following options:

Parameters: --html $File
Initial Folder: $FileDir
Capture Output and Sound alert when completed are checked, everything else is unchecked.
Regular expression to match output is the default: ^([^(]+)$(\d+)$:
File is blank, Line is 1, Column is 1

To invoke it, I have the document I am checking open and in the foreground in TextPad, then type <CTRL>-1 (control-one). It does the rest automatically. (I put the ~ in the file path myself. The validator supplied the entire path.)

Code: Select all

"file:/C:/Users/~/Source/errors.htm":120.30-120.30: error: Malformed byte sequence: Ã¢â‚¬Å“97Ã¢â‚¬Â�.

Tool completed with exit code 1

If I'm missing something, please clue me in. As far as I can tell, I am doing nothing to cause bad results. The only errors I get once all the attributes and font tags have been converted to CSS is this type of error when there is a character it doesn't like. Using a named entity fixes it. The text within the quotes (97) is in hex, as I see "e9" for the é character. The value changes according to what character it is. 97 is the emdash character.

Edit: I went to the online validator and pasted the document into the validate by direct input pane. The validator didn't have a problem with the emdash being there. I guess it's just the standalone vnu validator that has the issue.

Post by **AmigoJack** » Thu Jun 11, 2020 12:04 am

WayneCa wrote:How is it contradicting?

Re-read what you wrote - one of your sentences misses an important word to invert its meaning. As it stands now you do the same action twice.

WayneCa wrote:I put the em-dash character back into the document

Save the document in UTF-8 encoding with a BOM before running the tool.

WayneCa wrote:this type of error when there is a character it doesn't like

No, the error clearly says "byte sequence", not "character". Those things happen when the consumer (the tool) expects UTF-8 and you're feeding 8859 - didn't you stumble upon all the details I wrote? It's not enough to let HTML declare the encoding - the actual file must be encoded the same way.

WayneCa wrote:Edit: I went to the online validator and pasted the document into the validate by direct input pane. The validator didn't have a problem with the emdash being there. I guess it's just the standalone vnu validator that has the issue.

As I wrote before: the tool might also not be the problem, but how you save your file. As I wrote before: upload your actual file instead of copy/pasting text - if results differ then it's obvious you're not saving your file correctly to begin with.

Also consider using the parameter nu.validator.client.charset of v.Nu to provide the encoding of the document you feed it with, as per the manual.

WayneCa · Post by **WayneCa** » Thu Jun 11, 2020 1:16 pm

AmigoJack wrote:Re-read what you wrote - one of your sentences misses an important word to invert its meaning. As it stands now you do the same action twice.

I'm not certain what you mean here, but I will restate what I said. Hopefully that will remove any contradiction you see.

When I run the validator on a document that has, say, the ÃƒÂ¨ character in it, the validator reports a malformed byte sequence. If I replace the ÃƒÂ¨ with é the error is resolved.

AmigoJack wrote:Save the document in UTF-8 encoding with a BOM before running the tool.

I have been saving all my text documents the same way for over 30 years. Plain text. For the purposes of this discussion, my TextPad save parameters are:

Line endings: PC
Encoding: Default
UNICODE BOM: unchecked

This is the first time I have ever been told that I may be saving the files wrong.

AmigoJack wrote:No, the error clearly says "byte sequence", not "character". Those things happen when the consumer (the tool) expects UTF-8 and you're feeding 8859 - didn't you stumble upon all the details I wrote? It's not enough to let HTML declare the encoding - the actual file must be encoded the same way.

I understand that the error says byte sequence. I refer to character because it is those characters that are causing the error. Replacing them with a named entity has been correcting the error.

AmigoJack wrote:As I wrote before: the tool might also not be the problem, but how you save your file. As I wrote before: upload your actual file instead of copy/pasting text - if results differ then it's obvious you're not saving your file correctly to begin with.

I will try that, I used copy/paste as it was more expedient at the time.

Also consider using the parameter nu.validator.client.charset of v.Nu to provide the encoding of the document you feed it with, as per the manual.

I will look into that too. Many things in the manual were not understandable to me as to whether or not they related to my use of it in TextPad as an external tool.

Edit: I used the file upload to validate the page. To my utter amazement the validator kept telling me that the <!DOCTYPE html> statement contained characters that it couldn't interpret as UTF-8! I resaved the document with UTF-8 encoding and UNICODE BOM checked and it validated. Well, after I learned that I must include a <META charset="utf-8"/> statement in the document as well. I had not run across this requirement before either.

Also, I validated the page using the vnu I have installed as a tool, but without updating the parameter. It validated, even with the character still there and not replacing it with the named entity. I guess this means I need to go back and resave all of the documents using UTF-8 encoding with UNICODE BOM checked, and adding the META statement to each one.

Thanks for the help, and for your patience with me learning to wrap my head around this. It is very much appreciated.

Post by **AmigoJack** » Thu Jun 11, 2020 9:09 pm

I'm happy if you can now say to yourself "heck, I never thought about that context - all the time I was fixing symptoms instead of finding the actual culprit - good bye entities".

Text files can have different encodings, which should now be obvious to you. In the internet context (i.e. your internet browser or any *ML validator) the consuming software is most likely able to recognize Unicode encodings if the file has a BOM in it. So if you "just" save your files in UTF-8 encoding with a BOM then chances are you don't even need the <META> HTML tag anymore (in the worst case it's redundant and in the best case it confirms the recognized encoding).

Try it once and it should also validate.

WayneCa · Post by **WayneCa** » Thu Jun 11, 2020 10:32 pm

AmigoJack wrote:I'm happy if you can now say to yourself "heck, I never thought about that context - all the time I was fixing symptoms instead of finding the actual culprit - good bye entities".

Yes, yes I can, and I did! I changed all of the differing entities back to their actual characters in every file that contained them. Most of the occurrences were no-break spaces. It's nice to not have to fall back on them anymore, but it's also nice there's a complete list of all HTML entities I can refer to when I need to replace one. It's not easy trying to find them in Character Map.

AmigoJack wrote: <snip>

Try it once and it should also validate.

I will do that. It's just too bad I didn't see this before I added that META statement to about 30 documents.

Post by **MudGuard** » Fri Jun 12, 2020 9:06 am

Why would nowadays someone need entities anyway?

If you have characters, that are invisible (zero width joiner, zero width non-joiner, soft hyphen, non-breaking-space, left-to-right-switch, right-to-left-switch and so on), it is sometimes easier to edit with the entities, as these are visible.

pbaumann · Post by **pbaumann** » Fri Mar 19, 2021 9:03 am

Hi all,
I have the same opinion as MudGard. I'm using textpad to edit files for a huge HTML-documentation just in raw HTML with some tools for generating parts of the pages. And this documentation is still "oldfashioned" and is in the source code human friendly and in the same way typographically correct (a minus sign is not a replacement for ndash or mdash). The documentation set contains 10.000+ files and it would not be an easy task to switch text encoding.

In the syntax file CharStart, CharEnd and HTML=1 is correctly defined (see post from AmigoJack, June 10, 2020)

O. K., it looks like I have to live with the restriction (it's my opinion that this is a restriction).

Post by **AmigoJack** » Fri Mar 19, 2021 1:22 pm

But different dashes can be used directly: what your internet browser renders are actual characters which could be copied and used instead of the entity you use. And for the editor those characters (of course) will remain as those (and not be converted for magical reasons to hyphens).

That is the whole point - if the editor does not give you a good solution to display zero width joiners then look for a better editor - that is the same argumentation as not using tab characters because they always look like spaces.

The only real need for entities is to express text that must not collide with the markup (i.e. < for <).

Community

named character entities

named character entities

Re: named character entities

Re: named character entities

Re: named character entities

Re: named character entities

Re: named character entities

Re: named character entities

Agree with MudGuard